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Preface 


This book is intended to have three roles and to serve three associated audiences: an 
introductory text on Bayesian inference starting from first principles, a graduate text on 
effective current approaches to Bayesian modeling and computation in statistics and related 
fields, and a handbook of Bayesian methods in applied statistics for general users of and 
researchers in applied statistics. Although introductory in its early sections, the book is 
definitely not elementary in the sense of a first text in statistics. The mathematics used 
in our book is basic probability and statistics, elementary calculus, and linear algebra. A 
review of probability notation is given in Chapter 1 along with a more detailed list of topics 
assumed to have been studied. The practical orientation of the book means that the reader’s 
previous experience in probability, statistics, and linear algebra should ideally have included 
strong computational components. 

To write an introductory text alone would leave many readers with only a taste of the 
conceptual elements but no guidance for venturing into genuine practical applications, be- 
yond those where Bayesian methods agree essentially with standard non-Bayesian analyses. 
On the other hand, we feel it would be a mistake to present the advanced methods with- 
out first introducing the basic concepts from our data-analytic perspective. Furthermore, 
due to the nature of applied statistics, a text on current Bayesian methodology would be 
incomplete without a variety of worked examples drawn from real applications. To avoid 
cluttering the main narrative, there are bibliographic notes at the end of each chapter and 
references at the end of the book. 

Examples of real statistical analyses appear throughout the book, and we hope thereby 
to give an applied flavor to the entire development. Indeed, given the conceptual simplicity 
of the Bayesian approach, it is only in the intricacy of specific applications that novelty 
arises. Non-Bayesian approaches dominated statistical theory and practice for most of the 
last century, but the last few decades have seen a re-emergence of Bayesian methods. This 
has been driven more by the availability of new computational techniques than by what 
many would see as the theoretical and logical advantages of Bayesian thinking. 

In our treatment of Bayesian inference, we focus on practice rather than philosophy. We 
demonstrate our attitudes via examples that have arisen in the applied research of ourselves 
and others. Chapter 1 presents our views on the foundations of probability as empirical 
and measurable; see in particular Sections 1.4-1.7. 


Changes for the third edition 


The biggest change for this new edition is the addition of Chapters 20-23 on nonparametric 
modeling. Other major changes include weakly informative priors in Chapters 2, 5, and 
elsewhere; boundary-avoiding priors in Chapter 13; an updated discussion of cross-validation 
and predictive information criteria in the new Chapter 7; improved convergence monitoring 
and effective sample size calculations for iterative simulation in Chapter 11; presentations of 
Hamiltonian Monte Carlo, variational Bayes, and expectation propagation in Chapters 12 
and 13; and new and revised code in Appendix C. We have made other changes throughout. 

During the eighteen years since completing the first edition of Bayesian Data Analysis, 
we have worked on dozens of interesting applications which, for reasons of space, we are not 
able to add to this new edition. Many of these examples appear in our book, Data Analysis 


xiii 


This electronic edition is for non-commercial purposes only. 


xiv PREFACE 


Using Regression and Hierarchical/Multilevel Models, as well as in our published research 
articles. 

We have made some small corrections and updates for the second printing of the third 
edition. 


Online information 


Additional materials, including the data used in the examples, solutions to many of the 
end-of-chapter exercises, and any errors found after the book goes to press, are posted at 
http://www.stat.columbia.edu/~gelman/book/. Feel free to send any comments to us 
directly. 
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Part I: Fundamentals of Bayesian Inference 


Bayesian inference is the process of fitting a probability model to a set of data and sum- 
marizing the result by a probability distribution on the parameters of the model and on 
unobserved quantities such as predictions for new observations. In Chapters 1-3, we in- 
troduce several useful families of models and illustrate their application in the analysis of 
relatively simple data structures. Some mathematics arises in the analytical manipulation of 
the probability distributions, notably in transformation and integration in multiparameter 
problems. We differ somewhat from other introductions to Bayesian inference by emphasiz- 
ing stochastic simulation, and the combination of mathematical analysis and simulation, as 
general methods for summarizing distributions. Chapter 4 outlines the fundamental con- 
nections between Bayesian and other approaches to statistical inference. The early chapters 
focus on simple examples to develop the basic ideas of Bayesian inference; examples in which 
the Bayesian approach makes a practical difference relative to more traditional approaches 
begin to appear in Chapter 3. The major practical advantages of the Bayesian approach 
appear in Chapter 5, where we introduce hierarchical models, which allow the parameters 
of a prior, or population, distribution themselves to be estimated from data. 
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Chapter 1 


Probability and inference 


1.1 The three steps of Bayesian data analysis 


This book is concerned with practical methods for making inferences from data using prob- 
ability models for quantities we observe and for quantities about which we wish to learn. 
The essential characteristic of Bayesian methods is their explicit use of probability for quan- 
tifying uncertainty in inferences based on statistical data analysis. 

The process of Bayesian data analysis can be idealized by dividing it into the following 
three steps: 


1. Setting up a full probability model—a joint probability distribution for all observable and 
unobservable quantities in a problem. The model should be consistent with knowledge 
about the underlying scientific problem and the data collection process. 


2. Conditioning on observed data: calculating and interpreting the appropriate posterior 
distribution—the conditional probability distribution of the unobserved quantities of ul- 
timate interest, given the observed data. 


3. Evaluating the fit of the model and the implications of the resulting posterior distribution: 
how well does the model fit the data, are the substantive conclusions reasonable, and 
how sensitive are the results to the modeling assumptions in step 1? In response, one 
can alter or expand the model and repeat the three steps. 


Great advances in all these areas have been made in the last forty years, and many 
of these are reviewed and used in examples throughout the book. Our treatment covers 
all three steps, the second involving computational methodology and the third a delicate 
balance of technique and judgment, guided by the applied context of the problem. The first 
step remains a major stumbling block for much Bayesian analysis: just where do our models 
come from? How do we go about constructing appropriate probability specifications? We 
provide some guidance on these issues and illustrate the importance of the third step in 
retrospectively evaluating the fit of models. Along with the improved techniques available 
for computing conditional probability distributions in the second step, advances in carrying 
out the third step alleviate to some degree the need to assume correct model specification at 
the first attempt. In particular, the much-feared dependence of conclusions on ‘subjective’ 
prior distributions can be examined and explored. 

A primary motivation for Bayesian thinking is that it facilitates a common-sense in- 
terpretation of statistical conclusions. For instance, a Bayesian (probability) interval for 
an unknown quantity of interest can be directly regarded as having a high probability of 
containing the unknown quantity, in contrast to a frequentist (confidence) interval, which 
may strictly be interpreted only in relation to a sequence of similar inferences that might 
be made in repeated practice. Recently in applied statistics, increased emphasis has been 
placed on interval estimation rather than hypothesis testing, and this provides a strong im- 
petus to the Bayesian viewpoint, since it seems likely that most users of standard confidence 
intervals give them a common-sense Bayesian interpretation. One of our aims in this book 
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4 1. PROBABILITY AND INFERENCE 


is to indicate the extent to which Bayesian interpretations of common simple statistical 
procedures are justified. 

Rather than argue the foundations of statistics—see the bibliographic note at the end 
of this chapter for references to foundational debates—we prefer to concentrate on the 
pragmatic advantages of the Bayesian framework, whose flexibility and generality allow 
it to cope with complex problems. The central feature of Bayesian inference, the direct 
quantification of uncertainty, means that there is no impediment in principle to fitting 
models with many parameters and complicated multilayered probability specifications. In 
practice, the problems are ones of setting up and computing with such large models, and 
a large part of this book focuses on recently developed and still developing techniques 
for handling these modeling and computational challenges. The freedom to set up complex 
models arises in large part from the fact that the Bayesian paradigm provides a conceptually 
simple method for coping with multiple parameters, as we discuss in detail from Chapter 3 
on. 


1.2 General notation for statistical inference 


Statistical inference is concerned with drawing conclusions, from numerical data, about 
quantities that are not observed. For example, a clinical trial of a new cancer drug might 
be designed to compare the five-year survival probability in a population given the new drug 
to that in a population under standard treatment. These survival probabilities refer to a 
large population of patients, and it is neither feasible nor ethically acceptable to experiment 
on an entire population. Therefore inferences about the true probabilities and, in particular, 
their differences must be based on a sample of patients. In this example, even if it were 
possible to expose the entire population to one or the other treatment, it is never possible to 
expose anyone to both treatments, and therefore statistical inference would still be needed to 
assess the causal inference—the comparison between the observed outcome in each patient 
and that patient’s unobserved outcome if exposed to the other treatment. 

We distinguish between two kinds of estimands—unobserved quantities for which sta- 
tistical inferences are made—first, potentially observable quantities, such as future obser- 
vations of a process, or the outcome under the treatment not received in the clinical trial 
example; and second, quantities that are not directly observable, that is, parameters that 
govern the hypothetical process leading to the observed data (for example, regression coef- 
ficients). The distinction between these two kinds of estimands is not always precise, but is 
generally useful as a way of understanding how a statistical model for a particular problem 
fits into the real world. 


Parameters, data, and predictions 


As general notation, we let 0 denote unobservable vector quantities or population parameters 
of interest (such as the probabilities of survival under each treatment for randomly chosen 
members of the population in the example of the clinical trial), y denote the observed 
data (such as the numbers of survivors and deaths in each treatment group), and y denote 
unknown, but potentially observable, quantities (such as the outcomes of the patients under 
the other treatment, or the outcome under each of the treatments for a new patient similar 
to those already in the trial). In general these symbols represent multivariate quantities. 
We generally use Greek letters for parameters, lower case Roman letters for observed or 
observable scalars and vectors (and sometimes matrices), and upper case Roman letters 
for observed or observable matrices. When using matrix notation, we consider vectors as 
column vectors throughout; for example, if u is a vector with n components, then ufu is a 
scalar and uu? an n x n matrix. 
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1.2. GENERAL NOTATION FOR STATISTICAL INFERENCE 5 
Observational units and variables 


In many statistical studies, data are gathered on each of a set of n objects or units, and 
we can write the data as a vector, y = (y1,---,; Yn). In the clinical trial example, we might 
label y; as 1 if patient 7 is alive after five years or 0 if the patient dies. If several variables 
are measured on each unit, then each y; is actually a vector, and the entire dataset y is a 
matrix (usually taken to have n rows). The y variables are called the ‘outcomes’ and are 
considered ‘random’ in the sense that, when making inferences, we wish to allow for the 
possibility that the observed values of the variables could have turned out otherwise, due 
to the sampling process and the natural variation of the population. 


Exchangeability 


The usual starting point of a statistical analysis is the (often tacit) assumption that the 
n values y; may be regarded as exchangeable, meaning that we express uncertainty as a 
joint probability density p(yi,...,Yn) that is invariant to permutations of the indexes. A 
nonexchangeable model would be appropriate if information relevant to the outcome were 
conveyed in the unit indexes rather than by explanatory variables (see below). The idea of 
exchangeability is fundamental to statistics, and we return to it repeatedly throughout the 
book. 

We commonly model data from an exchangeable distribution as independently and iden- 
tically distributed (iid) given some unknown parameter vector 0 with distribution p(@). In 
the clinical trial example, we might model the outcomes y; as iid, given 0, the unknown 
probability of survival. 


Explanatory variables 


It is common to have observations on each unit that we do not bother to model as random. 
In the clinical trial example, such variables might include the age and previous health status 
of each patient in the study. We call this second class of variables explanatory variables, or 
covariates, and label them x. We use X to denote the entire set of explanatory variables 
for all n units; if there are k explanatory variables, then X is a matrix with n rows and k 
columns. Treating X as random, the notion of exchangeability can be extended to require 
the distribution of the n values of (x,y); to be unchanged by arbitrary permutations of 
the indexes. It is always appropriate to assume an exchangeable model after incorporating 
sufficient relevant information in X that the indexes can be thought of as randomly assigned. 
It follows from the assumption of exchangeability that the distribution of y, given z, is the 
same for all units in the study in the sense that if two units have the same value of x, then 
their distributions of y are the same. Any of the explanatory variables x can be moved into 
the y category if we wish to model them. We discuss the role of explanatory variables (also 
called predictors) in detail in Chapter 8 in the context of analyzing surveys, experiments, 
and observational studies, and in the later parts of this book in the context of regression 
models. 


Hierarchical modeling 


In Chapter 5 and subsequent chapters, we focus on hierarchical models (also called mul- 
tilevel models), which are used when information is available on several different levels of 
observational units. In a hierarchical model, it is possible to speak of exchangeability at 
each level of units. For example, suppose two medical treatments are applied, in separate 
randomized experiments, to patients in several different cities. Then, if no other information 
were available, it would be reasonable to treat the patients within each city as exchangeable 
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6 1. PROBABILITY AND INFERENCE 


and also treat the results from different cities as themselves exchangeable. In practice it 
would make sense to include, as explanatory variables at the city level, whatever relevant 
information we have on each city, as well as the explanatory variables mentioned before at 
the individual level, and then the conditional distributions given these explanatory variables 
would be exchangeable. 


1.3 Bayesian inference 


Bayesian statistical conclusions about a parameter 0, or unobserved data y, are made in 
terms of probability statements. These probability statements are conditional on the ob- 
served value of y, and in our notation are written simply as p(@/y) or p(gly). We also 
implicitly condition on the known values of any covariates, x. It is at the fundamental 
level of conditioning on observed data that Bayesian inference departs from the approach 
to statistical inference described in many textbooks, which is based on a retrospective eval- 
uation of the procedure used to estimate 0 (or J) over the distribution of possible y values 
conditional on the true unknown value of 6. Despite this difference, it will be seen that 
in many simple analyses, superficially similar conclusions result from the two approaches 
to statistical inference. However, analyses obtained using Bayesian methods can be easily 
extended to more complex problems. In this section, we present the basic mathematics and 
notation of Bayesian inference, followed in the next section by an example from genetics. 


Probability notation 

Some comments on notation are needed at this point. First, p(-|-) denotes a conditional 
probability density with the arguments determined by the context, and similarly for 
p(-), which denotes a marginal distribution. We use the terms ‘distribution’ and 
‘density’ interchangeably. The same notation is used for continuous density functions 
and discrete probability mass functions. Different distributions in the same equation 
(or expression) will each be denoted by p(-), as in (1.1) below, for example. Although 
an abuse of standard mathematical notation, this method is compact and similar to 
the standard practice of using p(-) for the probability of any discrete event, where 
the sample space is also suppressed in the notation. Depending on context, to avoid 
confusion, we may use the notation Pr(-) for the probability of an event; for example, 
Pr(0 > 2) = f} a p(9)d0. When using a standard distribution, we use a notation based 
on the name of the distribution; for example, if 0 has a normal distribution with mean 
u and variance o°, we write 0 ~ N(,07) or p(0) = N(@|u,07) or, to be even more 
explicit, p(O|u,07) = N(@|u,07). Throughout, we use notation such as N(u, o?) for 
random variables and N(@|u,07) for density functions. Notation and formulas for 
several standard distributions appear in Appendix A. 

We also occasionally use the following expressions for random variables 0: the coeffi- 
cient of variation is defined as sd(@)/E(@), the geometric mean is exp(E]log(0)]), and 
the geometric standard deviation is exp(sd[log(0)]). 


Bayes’ rule 
In order to make probability statements about 0 given y, we must begin with a model 
providing a joint probability distribution for 0 and y. The joint probability mass or density 


function can be written as a product of two densities that are often referred to as the prior 
distribution p(@) and the sampling distribution (or data distribution) p(y|0), respectively: 


p(0, y) = p(0)p(yl0). 
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Simply conditioning on the known value of the data y, using the basic property of conditional 
probability known as Bayes’ rule, yields the posterior density: 


seis geu) _ PO (1.1) 
p(y) p(y) 
where p(y) = $ p(0)p(y|0), and the sum is over all possible values of 0 (or p(y) = 
Jp(0)p(y|@)d0 in the case of continuous 0). An equivalent form of (1.1) omits the fac- 
tor p(y), which does not depend on 0 and, with fixed y, can thus be considered a constant, 
yielding the unnormalized posterior density, which is the right side of (1.2): 


Ply) x p(A)p(yl4). (1.2) 


The second term in this expression, p(y|@), is taken here as a function of 6, not of y. These 
simple formulas encapsulate the technical core of Bayesian inference: the primary task of 
any specific application is to develop the model p(6,y) and perform the computations to 
summarize p(6|y) in appropriate ways. 


Prediction 


To make inferences about an unknown observable, often called predictive inferences, we 
follow a similar logic. Before the data y are considered, the distribution of the unknown 
but observable y is 


p(y) = / p(y, 0)d0 = J. p(0)plylo)dð. (1.3) 


This is often called the marginal distribution of y, but a more informative name is the prior 
predictive distribution: prior because it is not conditional on a previous observation of the 
process, and predictive because it is the distribution for a quantity that is observable. 

After the data y have been observed, we can predict an unknown observable, 7, from 
the same process. For example, y = (y1,.--,;Yn) may be the vector of recorded weights of 
an object weighed n times on a scale, 6 = (u, o?) may be the unknown true weight of the 
object and the measurement variance of the scale, and y may be the yet to be recorded 
weight of the object in a planned new weighing. The distribution of y is called the posterior 
predictive distribution, posterior because it is conditional on the observed y and predictive 
because it is a prediction for an observable y: 


oly) = i p(5, 6ly)d6 
I p(al6, y)p(6ly)a6 
S 1 p(G1)p(Oly)a6. (1.4) 


The second and third lines display the posterior predictive distribution as an average of 
conditional predictions over the posterior distribution of 0. The last step follows from the 
assumed conditional independence of y and ¥ given 8. 


Likelihood 


Using Bayes’ rule with a chosen probability model means that the data y affect the posterior 
inference (1.2) only through p(y|0), which, when regarded as a function of 0, for fixed y, is 
called the likelihood function. In this way Bayesian inference is obeying what is sometimes 
called the likelihood principle. 
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The likelihood principle is reasonable, but only within the framework of the model or 
family of models adopted for a particular analysis. In practice, one can rarely be confident 
that the chosen model is correct. We shall see in Chapter 6 that sampling distributions 
(imagining repeated realizations of our data) can play an important role in checking model 
assumptions. In fact, our view of an applied Bayesian statistician is one who is willing to 
apply Bayes’ rule under a variety of possible models. 


Likelihood and odds ratios 


The ratio of the posterior density p(0|y) evaluated at the points 6; and 02 under a given 
model is called the posterior odds for 01 compared to 62. The most familiar application of 
this concept is with discrete parameters, with 02 taken to be the complement of 01. Odds 
provide an alternative representation of probabilities and have the attractive property that 
Bayes’ rule takes a particularly simple form when expressed in terms of them: 


Ply) _ pAr)p(yl@)/p(y) _ p(s) (yl) (1.5) 


P(O2|y) — p(92)p(yl@2)/p(y)  p(82) p(y|@2) 
In words, the posterior odds are equal to the prior odds multiplied by the likelihood ratio, 
p(y|01)/p(y|@2). 


1.4 Discrete examples: genetics and spell checking 


We next demonstrate Bayes’ theorem with two examples in which the immediate goal is 
inference about a particular discrete quantity rather than with the estimation of a parameter 
that describes an entire population. These discrete examples allow us to see the prior, 
likelihood, and posterior probabilities directly. 


Inference about a genetic status 


Human males have one X-chromosome and one Y-chromosome, whereas females have two 
X-chromosomes, each chromosome being inherited from one parent. Hemophilia is a disease 
that exhibits X-chromosome-linked recessive inheritance, meaning that a male who inherits 
the gene that causes the disease on the X-chromosome is affected, whereas a female carrying 
the gene on only one of her two X-chromosomes is not affected. The disease is generally fatal 
for women who inherit two such genes, and this is rare, since the frequency of occurrence 
of the gene is low in human populations. 


Prior distribution. Consider a woman who has an affected brother, which implies that her 
mother must be a carrier of the hemophilia gene with one ‘good’ and one ‘bad’ hemophilia 
gene. We are also told that her father is not affected; thus the woman herself has a fifty-fifty 
chance of having the gene. The unknown quantity of interest, the state of the woman, has 
just two values: the woman is either a carrier of the gene (0 = 1) or not (0 = 0). Based on 
the information provided thus far, the prior distribution for the unknown 0 can be expressed 
simply as Pr(@ = 1) = Pr( = 0) = 3. 
Data model and likelihood. The data used to update the prior information consist of the 
affection status of the woman’s sons. Suppose she has two sons, neither of whom is affected. 
Let y;=1 or 0 denote an affected or unaffected son, respectively. The outcomes of the two 
sons are exchangeable and, conditional on the unknown 6, are independent; we assume the 
sons are not identical twins. The two items of independent data generate the following 
likelihood function: 


Pr(yi =0, y2 = 0|0=1) (0.5)(0.5) = 0.25 
Pr(yi=0, y2 = 0/0=0) = (1)(1) =1. 
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These expressions follow from the fact that if the woman is a carrier, then each of her sons 
will have a 50% chance of inheriting the gene and so being affected, whereas if she is not a 
carrier then there is a probability close to 1 that a son of hers will be unaffected. (In fact, 
there is a nonzero probability of being affected even if the mother is not a carrier, but this 
risk—the mutation rate—is small and can be ignored for this example.) 


Posterior distribution. Bayes’ rule can now be used to combine the information in the 
data with the prior probability; in particular, interest is likely to focus on the posterior 
probability that the woman is a carrier. Using y to denote the joint data (yi, y2), this is 
simply 


p(yl@ = 1)Pr(@ = 1) 

p(y|@ = 1)Pr(@ = 1) + p(y|@ = 0)Pr(8 = 0) 
(0.25)(0.5) 0.125 
(0.25)(0.5) + (1.0)(0.5) 0.625 


Pr(@= lly) = 
= 0.20. 


Intuitively it is clear that if a woman has unaffected children, it is less probable that she is 
a carrier, and Bayes’ rule provides a formal mechanism for determining the extent of the 
correction. The results can also be described in terms of prior and posterior odds. The 
prior odds of the woman being a carrier are 0.5/0.5 = 1. The likelihood ratio based on 
the information about her two unaffected sons is 0.25/1 = 0.25, so the posterior odds are 
1-0.25 = 0.25. Converting back to a probability, we obtain 0.25/(1+ 0.25) = 0.2, as before. 


Adding more data. A key aspect of Bayesian analysis is the ease with which sequential 
analyses can be performed. For example, suppose that the woman has a third son, who 
is also unaffected. The entire calculation does not need to be redone; rather we use the 
previous posterior distribution as the new prior distribution, to obtain: 


0.5)(0.20 
Pr(O = 1|y1, y2, y3) = Day) = 0.111. 


0.5)(0.20) + (1)(0.8) 
Alternatively, if we suppose that the third son is affected, it is easy to check that the 
posterior probability of the woman being a carrier becomes 1 (again ignoring the possibility 
of a mutation). 


Spelling correction 


Classification of words is a problem of managing uncertainty. For example, suppose someone 
types ‘radom.’ How should that be read? It could be a misspelling or mistyping of ‘random’ 
or ‘radon’ or some other alternative, or it could be the intentional typing of ‘radom’ (as 
in its first use in this paragraph). What is the probability that ‘radom’ actually means 
random? If we label y as the data and @ as the word that the person was intending to type, 
then 

Pr(6|y=‘radom’) œ p(0) Pr(y=‘radom’ | 6). (1.6) 
This product is the unnormalized posterior density. In this case, if for simplicity we consider 
only three possibilities for the intended word, 0 (random, radon, or radom), we can compute 


the posterior probability of interest by first computing the unnormalized density for all three 
values of theta and then normalizing: 


p(61)p(‘radom’|6; ) 


p(random|‘radom’) = =3————_—__—__, 
X j=1 P(9;)p(‘radom’|4;) 


where 6,;=random, #2=radon, and 63;=radom. The prior probabilities p(6;) can most simply 
come from frequencies of these words in some large database, ideally one that is adapted 
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to the problem at hand (for example, a database of recent student emails if the word in 
question is appearing in such a document). The likelihoods p(y|@;) can come from some 
modeling of spelling and typing errors, perhaps fit using some study in which people were 
followed up after writing emails to identify any questionable words. 


Prior distribution. Without any other context, it makes sense to assign the prior proba- 
bilities p(6;) based on the relative frequencies of these three words in some databases. Here 
are probabilities supplied by researchers at Google: 


0 p(?) 


random 7.60 x 107° 


radon 6.05 x 1076 
radom 3.12 x 1077 


Since we are considering only these possibilities, we could renormalize the three numbers to 
sum to 1 (p(random) = CURE etc.) but there is no need, as the adjustment would 
merely be absorbed into the proportionality constant in (1.6). 

Returning to the table above, we were surprised to see the probability of ‘radom’ in the 
corpus being as high as it was. We looked up the word in Wikipedia and found that it is a 
medium-sized city: home to ‘the largest and best-attended air show in Poland ...also the 
popular unofficial name for a semiautomatic 9 mm Para pistol of Polish design ...’ For 
the documents that we encounter, the relative probability of ‘radom’ seems much too high. 
If the probabilities above do not seem appropriate for our application, this implies that we 
have prior information or beliefs that have not yet been included in the model. We shall 
return to this point after first working out the model’s implications for this example. 


Likelihood. Here are some conditional probabilities from Google’s model of spelling and 
typing errors: 
0 p(‘radom’|6) 
random 0.00193 
radon 0.000143 
radom 0.975 


We emphasize that this likelihood function is not a probability distribution. Rather, it is a 
set of conditional probabilities of a particular outcome (‘radom’) from three different proba- 
bility distributions, corresponding to three different possibilities for the unknown parameter 
0. 

These particular values look reasonable enough—a 97% chance that this particular five- 
letter word will be typed correctly, a 0.2% chance of obtaining this character string by 
mistakenly dropping a letter from ‘random,’ and a much lower chance of obtaining it by 
mistyping the final letter of ‘radon.’ We have no strong intuition about these probabilities 
and will trust the Google engineers here. 


Posterior distribution. We multiply the prior probability and the likelihood to get joint 
probabilities and then renormalize to get posterior probabilities: 


0 p(0)p(‘radom’|0) p(0|‘radom’) 


random 1.47 x 107 0.325 
radon 8.65 x 10719 0.002 
radom 3.04x 1077 0.673 


Thus, conditional on the model, the typed word ‘radom’ is about twice as likely to be correct 
as to be a typographical error for ‘random,’ and it is very unlikely to be a mistaken instance 
of ‘radon.’ A fuller analysis would include possibilities beyond these three words, but the 
basic idea is the same. 


Decision making, model checking, and model improvement. We can envision two directions 
to go from here. The first approach is to accept the two-thirds probability that the word 
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was typed correctly or even to simply declare ‘radom’ as correct on first pass. The second 
option would be to question this probability by saying, for example, that ‘radom’ looks like 
a typo and that the estimated probability of it being correct seems much too high. 

When we dispute the claims of a posterior distribution, we are saying that the model 
does not fit the data or that we have additional prior information not included in the model 
so far. In this case, we are only examining one word so lack of fit is not the issue; thus a 
dispute over the posterior must correspond to a claim of additional information, either in 
the prior or the likelihood. 

For this problem we have no particular grounds on which to criticize the likelihood. The 
prior probabilities, on the other hand, are highly context dependent. The word ‘random’ is 
of course highly frequent in our own writing on statistics, ‘radon’ occurs occasionally (see 
Section 9.4), while ‘radom’ was entirely new to us. Our surprise at the high probability of 
‘radom’ represents additional knowledge relevant to our particular problem. 

The model can be elaborated most immediately by including contextual information in 
the prior probabilities. For example, if the document under study is a statistics book, then 
it becomes more likely that the person intended to type ‘random.’ If we label x as the 
contextual information used by the model, the Bayesian calculation then becomes, 


p(x, y) x p(O|x)p(yl9, x). 


To first approximation, we can simplify that last term to p(y|@), so that the probability 
of any particular error (that is, the probability of typing a particular string y given the 
intended word 0) does not depend on context. This is not a perfect assumption but could 
reduce the burden of modeling and computation. 

The practical challenges in Bayesian inference involve setting up models to estimate all 
these probabilities from data. At that point, as shown above, Bayes’ rule can be easily 
applied to determine the implications of the model for the problem at hand. 


1.5 Probability as a measure of uncertainty 


We have already used concepts such as probability density, and indeed we assume that the 
reader has a fair degree of familiarity with basic probability theory (although in Section 
1.8 we provide a brief technical review of some probability calculations that often arise 
in Bayesian analysis). But since the uses of probability within a Bayesian framework are 
much broader than within non-Bayesian statistics, it is important to consider at least briefly 
the foundations of the concept of probability before considering more detailed statistical 
examples. We take for granted a common understanding on the part of the reader of the 
mathematical definition of probability: that probabilities are numerical quantities, defined 
on a set of ‘outcomes,’ that are nonnegative, additive over mutually exclusive outcomes, 
and sum to 1 over all possible mutually exclusive outcomes. 

In Bayesian statistics, probability is used as the fundamental measure or yardstick of 
uncertainty. Within this paradigm, it is equally legitimate to discuss the probability of 
‘rain tomorrow’ or of a Brazilian victory in the soccer World Cup as it is to discuss the 
probability that a coin toss will land heads. Hence, it becomes as natural to consider the 
probability that an unknown estimand lies in a particular range of values as it is to consider 
the probability that the mean of a random sample of 10 items from a known fixed population 
of size 100 will lie in a certain range. The first of these two probabilities is of more interest 
after data have been acquired whereas the second is more relevant beforehand. Bayesian 
methods enable statements to be made about the partial knowledge available (based on 
data) concerning some situation or ‘state of nature’ (unobservable or as yet unobserved) in 
a systematic way, using probability as the yardstick. The guiding principle is that the state 
of knowledge about anything unknown is described by a probability distribution. 
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What is meant by a numerical measure of uncertainty? For example, the probability of 
‘heads’ in a coin toss is widely agreed to be 4. Why is this so? Two justifications seem to 
be commonly given: 


1. Symmetry or exchangeability argument: 


35 number of favorable cases 
probability = ———_—_—_—____——_ 

number of possibilities 
assuming equally likely possibilities. For a coin toss this is really a physical argument, 
based on assumptions about the forces at work in determining the manner in which the 
coin will fall, as well as the initial physical conditions of the toss. 


2. Frequency argument: probability = relative frequency obtained in a long sequence of 
tosses, assumed to be performed in an identical manner, physically independently of 
each other. 


Both the above arguments are in a sense subjective, in that they require judgments about 

the nature of the coin and the tossing procedure, and both involve semantic arguments 

about the meaning of equally likely events, identical measurements, and independence. 

The frequency argument may be perceived to have certain special difficulties, in that it 

involves the hypothetical notion of a long sequence of identical tosses. If taken strictly, this 

point of view does not allow a statement of probability for a single coin toss that does not 
happen to be embedded, at least conceptually, in a long sequence of identical events. 

The following examples illustrate how probability judgments can be increasingly subjec- 
tive. First, consider the following modified coin experiment. Suppose that a particular coin 
is stated to be either double-headed or double-tailed, with no further information provided. 
Can one still talk of the probability of heads? It seems clear that in common parlance one 
certainly can. It is less clear, perhaps, how to assess this new probability, but many would 
agree on the same value of 4, perhaps based on the exchangeability of the labels ‘heads’ 
and ‘tails.’ 

Now consider some further examples. Suppose Colombia plays Brazil in soccer to- 
morrow: what is the probability of Colombia winning? What is the probability of rain 
tomorrow? What is the probability that Colombia wins, if it rains tomorrow? What is 
the probability that a specified rocket launch will fail? Although each of these questions 
seems reasonable in a common-sense way, it is difficult to contemplate strong frequency 
interpretations for the probabilities being referenced. Frequency interpretations can usually 
be constructed, however, and this is an extremely useful tool in statistics. For example, one 
can consider the future rocket launch as a sample from the population of potential launches 
of the same type, and look at the frequency of past launches that have failed (see the bib- 
liographic note at the end of this chapter for more details on this example). Doing this 
sort of thing scientifically means creating a probability model (or, at least, a ‘reference set’ 
of comparable events), and this brings us back to a situation analogous to the simple coin 
toss, where we must consider the outcomes in question as exchangeable and thus equally 
likely. 

Why is probability a reasonable way of quantifying uncertainty? The following reasons 
are often advanced. 

1. By analogy: physical randomness induces uncertainty, so it seems reasonable to describe 
uncertainty in the language of random events. Common speech uses many terms such 
as ‘probably’ and ‘unlikely,’ and it appears consistent with such usage to extend a more 
formal probability calculus to problems of scientific inference. 


2. Axiomatic or normative approach: related to decision theory, this approach places all sta- 
tistical inference in the context of decision-making with gains and losses. Then reasonable 
axioms (ordering, transitivity, and so on) imply that uncertainty must be represented in 
terms of probability. We view this normative rationale as suggestive but not compelling. 
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3. Coherence of bets. Define the probability p attached (by you) to an event EF as the 
fraction (p € [0,1]) at which you would exchange (that is, bet) $p for a return of $1 if Æ 
occurs. That is, if E occurs, you gain $(1 — p); if the complement of E occurs, you lose 
$p. For example: 


e Coin toss: thinking of the coin toss as a fair bet suggests even odds corresponding to 
1 
p> >> 


e Odds for a game: if you are willing to bet on team A to win a game at 10 to 1 odds 
against team B (that is, you bet 1 to win 10), your ‘probability’ for team A winning 
is at least 4. 


The principle of coherence states that your assignment of probabilities to all possible 
events should be such that it is not possible to make a definite gain by betting with you. 
It can be proved that probabilities constructed under this principle must satisfy the basic 
axioms of probability theory. 

The betting rationale has some fundamental difficulties: 


e Exact odds are required, on which you would be willing to bet in either direction, for 
all events. How can you assign exact odds if you are not sure? 


e If a person is willing to bet with you, and has information you do not, it might not 
be wise for you to take the bet. In practice, probability is an incomplete (necessary 
but not sufficient) guide to betting. 


All of these considerations suggest that probabilities may be a reasonable approach to 
summarizing uncertainty in applied statistics, but the ultimate proof is in the success of the 
applications. The remaining chapters of this book demonstrate that probability provides a 
rich and flexible framework for handling uncertainty in statistical applications. 


Subjectivity and objectivity 


All statistical methods that use probability are subjective in the sense of relying on math- 
ematical idealizations of the world. Bayesian methods are sometimes said to be especially 
subjective because of their reliance on a prior distribution, but in most problems, scientific 
judgment is necessary to specify both the ‘likelihood’ and the ‘prior’ parts of the model. For 
example, linear regression models are generally at least as suspect as any prior distribution 
that might be assumed about the regression parameters. A general principle is at work 
here: whenever there is replication, in the sense of many exchangeable units observed, there 
is scope for estimating features of a probability distribution from data and thus making the 
analysis more ‘objective.’ If an experiment as a whole is replicated several times, then the 
parameters of the prior distribution can themselves be estimated from data, as discussed in 
Chapter 5. In any case, however, certain elements requiring scientific judgment will remain, 
notably the choice of data included in the analysis, the parametric forms assumed for the 
distributions, and the ways in which the model is checked. 


1.6 Example: probabilities from football point spreads 


As an example of how probabilities might be assigned using empirical data and plausible 
substantive assumptions, we consider methods of estimating the probabilities of certain 
outcomes in professional (American) football games. This is an example only of probability 
assignment, not of Bayesian inference. A number of approaches to assigning probabilities 
for football game outcomes are illustrated: making subjective assessments, using empirical 
probabilities based on observed data, and constructing a parametric probability model. 
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point spread 
Figure 1.1 Scatterplot of actual outcome vs. point spread for each of 672 professional football games. 
The x and y coordinates are jittered by adding uniform random numbers to each point’s coordinates 
(between —0.1 and 0.1 for the x coordinate; between —0.2 and 0.2 for the y coordinate) in order to 
display multiple values but preserve the discrete-valued nature of each. 


Football point spreads and game outcomes 


Football experts provide a point spread for every football game as a measure of the difference 
in ability between the two teams. For example, team A might be a 3.5-point favorite to 
defeat team B. The implication of this point spread is that the proposition that team A, 
the favorite, defeats team B, the underdog, by 4 or more points is considered a fair bet; in 
other words, the probability that A wins by more than 3.5 points is Z. If the point spread 
is an integer, then the implication is that team A is as likely to win by more points than 
the point spread as it is to win by fewer points than the point spread (or to lose); there is 
positive probability that A will win by exactly the point spread, in which case neither side 
is paid off. The assignment of point spreads is itself an interesting exercise in probabilistic 
reasoning; one interpretation is that the point spread is the median of the distribution of 
the gambling population’s beliefs about the possible outcomes of the game. For the rest 
of this example, we treat point spreads as given and do not worry about how they were 
derived. 


The point spread and actual game outcome for 672 professional football games played 
during the 1981, 1983, and 1984 seasons are graphed in Figure 1.1. (Much of the 1982 
season was canceled due to a labor dispute.) Each point in the scatterplot displays the 
point spread, x, and the actual outcome (favorite’s score minus underdog’s score), y. (In 
games with a point spread of zero, the labels ‘favorite’ and ‘underdog’ were assigned at 
random.) A small random jitter is added to the x and y coordinate of each point on the 
graph so that multiple points do not fall exactly on top of each other. 


Assigning probabilities based on observed frequencies 


It is of interest to assign probabilities to particular events: Pr(favorite wins), Pr(favorite 
wins | point spread is 3.5 points), Pr(favorite wins by more than the point spread), Pr(favorite 
wins by more than the point spread | point spread is 3.5 points), and so forth. We might 
report a subjective probability based on informal experience gathered by reading the news- 
paper and watching football games. The probability that the favored team wins a game 
should certainly be greater than 0.5, perhaps between 0.6 and 0.75? More complex events 
require more intuition or knowledge on our part. A more systematic approach is to assign 
probabilities based on the data in Figure 1.1. Counting a tied game as one-half win and 
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Figure 1.2 (a) Scatterplot of (actual outcome — point spread) vs. point spread for each of 672 
professional football games (with uniform random jitter added to x and y coordinates). (b) His- 
togram of the differences between the game outcome and the point spread, with the N(0, 14°) density 
superimposed. 


one-half loss, and ignoring games for which the point spread is zero (and thus there is no 
favorite), we obtain empirical estimates such as: 


e Pr(favorite wins) = 482 = 0.63 


e Pr(favorite wins |x = 3.5) = # = 0.61 


e Pr(favorite wins by more than the point spread) = aS = 0.47 


e Pr(favorite wins by more than the point spread | a = 3.5) = 2 = 0.54. 

These empirical probability assignments all seem sensible in that they match the intu- 
ition of knowledgeable football fans. However, such probability assignments are problematic 
for events with few directly relevant data points. For example, 8.5-point favorites won five 
out of five times during this three-year period, whereas 9-point favorites won thirteen out of 
twenty times. However, we realistically expect the probability of winning to be greater for 
a 9-point favorite than for an 8.5-point favorite. The small sample size with point spread 
8.5 leads to imprecise probability assignments. We consider an alternative method using a 
parametric model. 


A parametric model for the difference between outcome and point spread 


Figure 1.2a displays the differences y— x between the observed game outcome and the point 
spread, plotted versus the point spread, for the games in the football dataset. (Once again, 
random jitter was added to both coordinates.) This plot suggests that it may be roughly 
reasonable to model the distribution of y — x as independent of x. (See Exercise 6.10.) 
Figure 1.2b is a histogram of the differences y — x for all the football games, with a fitted 
normal density superimposed. This plot suggests that it may be reasonable to approximate 
the marginal distribution of the random variable d = y — x by a normal distribution. The 
sample mean of the 672 values of d is 0.07, and the sample standard deviation is 13.86, 
suggesting that the results of football games are approximately normal with mean equal to 
the point spread and standard deviation nearly 14 points (two converted touchdowns). For 
the remainder of the discussion we take the distribution of d to be independent of x and 
normal with mean zero and standard deviation 14 for each x; that is, 


d|x ~ N(0, 147), 


as displayed in Figure 1.2b. The assigned probability model is not perfect: it does not fit 
the data exactly, and, as is often the case with real data, neither football scores nor point 
spreads are continuous-valued quantities. 
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Assigning probabilities using the parametric model 


Nevertheless, the model provides a convenient approximation that can be used to assign 

probabilities to events. If d has a normal distribution with mean zero and is independent of 

the point spread, then the probability that the favorite wins by more than the point spread 

is 4, conditional on any value of the point spread, and therefore unconditionally as well. 

Denoting probabilities obtained by the normal model as Pryorm, the probability that an 

x-point favorite wins the game can be computed, assuming the normal model, as follows: 
T 


Pryorm(y>0|2) = Praorm(d>—2|2) =1- ð (-=) 


where ® is the standard normal cumulative distribution function. For example, 
e Prporm(favorite wins | £ = 3.5) = 0.60 
e Prporm(favorite wins | x = 8.5) = 0.73 
e Prnorm(favorite wins |x = 9.0) = 0.74. 


The probability for a 3.5-point favorite agrees with the empirical value given earlier, whereas 
the probabilities for 8.5- and 9-point favorites make more intuitive sense than the empirical 
values based on small samples. 


1.7 Example: calibration for record linkage 


We emphasize the essentially empirical (not ‘subjective’ or ‘personal’) nature of probabilities 
with another example in which they are estimated from data. 

Record linkage refers to the use of an algorithmic technique to identify records from 
different databases that correspond to the same individual. Record-linkage techniques are 
used in a variety of settings. The work described here was formulated and first applied in 
the context of record linkage between the U.S. Census and a large-scale post-enumeration 
survey, which is the first step of an extensive matching operation conducted to evaluate 
census coverage for subgroups of the population. The goal of this first step is to declare as 
many records as possible ‘matched’ by computer without an excessive rate of error, thereby 
avoiding the cost of the resulting manual processing for all records not declared ‘matched.’ 


Existing methods for assigning scores to potential matches 


Much attention has been paid in the record-linkage literature to the problem of assigning 
‘weights’ to individual fields of information in a multivariate record and obtaining a com- 
posite ‘score,’ which we call y, that summarizes the closeness of agreement between two 
records. Here, we assume that this step is complete in the sense that these rules have been 
chosen. The next step is the assignment of candidate matched pairs, where each pair of 
records consists of the best potential match for each other from the respective databases. 
The specified weighting rules then order the candidate matched pairs. In the motivating 
problem at the Census Bureau, a binary choice is made between the alternatives ‘declare 
matched’ vs. ‘send to followup,’ where a cutoff score is needed above which records are 
declared matched. The false-match rate is then defined as the number of falsely matched 
pairs divided by the number of declared matched pairs. 

Particularly relevant for any such decision problem is an accurate method for assessing 
the probability that a candidate matched pair is a correct match as a function of its score. 
Simple methods exist for converting the scores into probabilities, but these lead to extremely 
inaccurate, typically grossly optimistic, estimates of false-match rates. For example, a 
manual check of a set of records with nominal false-match probabilities ranging from 1073 
to 1077 (that is, pairs deemed almost certain to be matches) found actual false-match rates 
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Figure 1.3 Histograms of weight scores y for true and false matches in a sample of records from 
the 1988 test Census. Most of the matches in the sample are true (because a pre-screening process 
has already picked these as the best potential match for each case), and the two distributions are 
mostly, but not completely, separated. 


closer to the 1% range. Records with nominal false-match probabilities of 1% had an actual 
false-match rate of 5%. 

We would like to use Bayesian methods to recalibrate these to obtain objective proba- 
bilities of matching for a given decision rule—in the same way that in the football example, 
we used past data to estimate the probabilities of different game outcomes conditional on 
the point spread. Our approach is to work with the scores y and empirically estimate the 
probability of a match as a function of y. 


Estimating match probabilities empirically 


We obtain accurate match probabilities using mixture modeling, a topic we discuss in detail 
in Chapter 22. The distribution of previously obtained scores for the candidate matches 
is considered a ‘mixture’ of a distribution of scores for true matches and a distribution for 
non-matches. The parameters of the mixture model are estimated from the data. The 
estimated parameters allow us to calculate an estimate of the probability of a false match 
(a pair declared matched that is not a true match) for any given decision threshold on the 
scores. In the procedure that was actually used, some elements of the mixture model (for 
example, the optimal transformation required to allow a mixture of normal distributions 
to apply) were fit using ‘training’ data with known match status (separate from the data 
to which we apply our calibration procedure), but we do not describe those details here. 
Instead we focus on how the method would be used with a set of data with unknown match 
status. 

Support for this approach is provided in Figure 1.3, which displays the distribution of 
scores for the matches and non-matches in a particular dataset obtained from 2300 records 
from a ‘test Census’ survey conducted in a single local area two years before the 1990 Census. 
The two distributions, p(y|match) and p(y|non-match), are mostly distinct—meaning that 
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Figure 1.4 Lines show expected false-match rate (and 95% bounds) as a function of the proportion 
of cases declared matches, based on the mixture model for record linkage. Dots show the actual 
false-match rate for the data. 


in most cases it is possible to identify a candidate as a match or not given the score alone— 
but with some overlap. 

In our application dataset, we do not know the match status. Thus we are faced with a 
single combined histogram from which we estimate the two component distributions and the 
proportion of the population of scores that belong to each component. Under the mixture 
model, the distribution of scores can be written as, 


p(y) = Pr(match) p(y|match) + Pr(non-match) p(y|non-match). (1.7) 


The mixture probability (Pr(match)) and the parameters of the distributions of matches 
(p(y|match)) and non-matches (p(y|non-match)) are estimated using the mixture model 
approach (as described in Chapter 22) applied to the combined histogram from the data 
with unknown match status. 

To use the method to make record-linkage decisions, we construct a curve giving the 
false-match rate as a function of the decision threshold, the score above which pairs will 
be ‘declared’ a match. For a given decision threshold, the probability distributions in (1.7) 
can be used to estimate the probability of a false match, a score y above the threshold 
originating from the distribution p(y|non-match). The lower the threshold, the more pairs 
we will declare as matches. As we declare more matches, the proportion of errors increases. 
The approach described here should provide an objective error estimate for each threshold. 
(See the validation in the next paragraph.) Then a decision maker can determine the 
threshold that provides an acceptable balance between the goals of declaring more matches 
automatically (thus reducing the clerical labor) and making fewer mistakes. 


External validation of the probabilities using test data 


The approach described above was externally validated using data for which the match 
status is known. The method was applied to data from three different locations of the 1988 
test Census, and so three tests of the methods were possible. We provide detailed results 
for one; results for the other two were similar. The mixture model was fitted to the scores 
of all the candidate pairs at a test site. Then the estimated model was used to create the 
lines in Figure 1.4, which show the expected false-match rate (and uncertainty bounds) in 
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Figure 1.5 Expansion of Figure 1.4 in the region where the estimated and actual match rates change 
rapidly. In this case, it would seem a good idea to match about 88% of the cases and send the rest 
to followup. 


terms of the proportion of cases declared matched, as the threshold varies from high (thus 
allowing no matches) to low (thus declaring almost all the candidate pairs to be matches). 
The false-match proportion is an increasing function of the number of declared matches, 
which makes sense: as we move rightward on the graph, we are declaring weaker and weaker 
cases to be matches. 

The lines on Figure 1.4 display the expected proportion of false matches and 95% pos- 
terior bounds for the false-match rate as estimated from the model. (These bounds give 
the estimated range within which there is 95% posterior probability that the false-match 
rate lies. The concept of posterior intervals is discussed in more detail in the next chapter.) 
The dots in the graph display the actual false-match proportions, which track well with the 
model. In particular, the model would suggest a recommendation of declaring something 
less than 90% of cases as matched and giving up on the other 10% or so, so as to avoid 
most of the false matches, and the dots show a similar pattern. 

It is clearly possible to match large proportions of the files with little or no error. Also, 
the quality of candidate matches becomes dramatically worse at some point where the 
false-match rate accelerates. Figure 1.5 takes a magnifying glass to the previous display 
to highlight the behavior of the calibration procedure in the region of interest where the 
false-match rate accelerates. The predicted false-match rate curves bend upward, close to 
the points where the observed false-match rate curves rise steeply, which is a particularly 
encouraging feature of the calibration method. The calibration procedure performs well 
from the standpoint of providing predicted probabilities that are close to the true probabili- 
ties and interval estimates that are informative and include the true values. By comparison, 
the original estimates of match probabilities, constructed by multiplying weights without 
empirical calibration, were highly inaccurate. 


1.8 Some useful results from probability theory 


We assume the reader is familiar with elementary manipulations involving probabilities 
and probability distributions. In particular, basic probability background that must be 
well understood for key parts of the book includes the manipulation of joint densities, the 
definition of simple moments, the transformation of variables, and methods of simulation. In 
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this section we briefly review these assumed prerequisites and clarify some further notational 
conventions used in the remainder of the book. Appendix A provides information on some 
commonly used probability distributions. 

As introduced in Section 1.3, we generally represent joint distributions by their joint 
probability mass or density function, with dummy arguments reflecting the name given to 
each variable being considered. Thus for two quantities u and v, we write the joint density 
as p(u,v); if specific values need to be referenced, this notation will be further abused as 
with, for example, p(u, v=1). 

In Bayesian calculations relating to a joint density p(u,v), we will often refer to a 
conditional distribution or density function such as p(u|v) and a marginal density such as 
p(u) = fp(u,v)dv. In this notation, either or both u and v can be vectors. Typically 
it will be clear from the context that the range of integration in the latter expression 
refers to the entire range of the variable being integrated out. It is also often useful to 
factor a joint density as a product of marginal and conditional densities; for example, 
plu, v, w) = p(ulv, w)p(vjw)p(w). 

Some authors use different notations for distributions on parameters and observables— 
for example, 7 (0), f(y|@)—but this obscures the fact that all probability distributions have 
the same logical status in Bayesian inference. We must always be careful, though, to in- 
dicate appropriate conditioning; for example, p(y|@) is different from p(y). In the inter- 
ests of conciseness, however, our notation hides the conditioning on hypotheses that hold 
throughout—no probability judgments can be made in a vacuum—and to be more explicit 
one might use a notation such as the following: 


PO, y| H) = p(O|H) p(y, H), 


where H refers to the set of hypotheses or assumptions used to define the model. Also, we 
sometimes suppress explicit conditioning on known explanatory variables, x. 
We use the standard notations, E(-) and var(-), for mean and variance, respectively: 


E(u) = fioi var(u) = fe- E(u))?p(u)du. 


For a vector parameter u, the expression for the mean is the same, and the covariance 
matrix is defined as 


var(u) = f (u — E(u)) (u — Blu)" plu)du, 


where u is considered a column vector. (We use the terms ‘variance matrix’ and ‘covariance 
matrix’ interchangeably.) This notation is slightly imprecise, because E(u) and var(u) are 
really functions of the distribution function, p(u), not of the variable u. In an expression 
involving an expectation, any variable that does not appear explicitly as a conditioning 
variable is assumed to be integrated out in the expectation; for example, E(uļ|v) refers to 
the conditional expectation of u with v held fixed—that is, the conditional expectation as 
a function of v—whereas E(u) is the expectation of u, averaging over v (as well as u). 


Modeling using conditional probability 


Useful probability models often express the distribution of observables conditionally or hier- 
archically rather than through more complicated unconditional distributions. For example, 
suppose y is the height of a university student selected at random. The marginal distri- 
bution p(y) is (essentially) a mixture of two approximately normal distributions centered 
around 160 and 175 centimeters. A more useful description of the distribution of y would 
be based on the joint distribution of height and sex: p(male) ~ p(female) ~ 4, along with 
the conditional specifications that p(y|female) and p(y|male) are each approximately normal 


This electronic edition is for non-commercial purposes only. 


1.8. SOME USEFUL RESULTS FROM PROBABILITY THEORY 21 


with means 160 and 175 cm, respectively. If the conditional variances are not too large, 
the marginal distribution of y is bimodal. In general, we prefer to model complexity with 
a hierarchical structure using additional variables rather than with complicated marginal 
distributions, even when the additional variables are unobserved or even unobservable; this 
theme underlies mixture models, as discussed in Chapter 22. We repeatedly return to the 
theme of conditional modeling throughout the book. 


Means and variances of conditional distributions 


It is often useful to express the mean and variance of a random variable u in terms of 
the conditional mean and variance given some related quantity v. The mean of u can be 
obtained by averaging the conditional mean over the marginal distribution of v, 


E(u) = E(E(u|v)), (1.8) 


where the inner expectation averages over u, conditional on v, and the outer expectation 
averages over v. Identity (1.8) is easy to derive by writing the expectation in terms of the 
joint distribution of u and v and then factoring the joint distribution: 


E(u) = J / up(u,v)dudv = I I up(ulv)du p(v)dv = J E(uļv)plv)dv. 


The corresponding result for the variance includes two terms, the mean of the conditional 
variance and the variance of the conditional mean: 


var(u) = E(var(ul|v)) + var(E(ulv)). (1.9) 
This result can be derived by expanding the terms on the right side of (1.9): 
(E(u?|v) — E(ulv))?) + E ((E(ulv))’) — Œ (E(ulv)))? 
(u?) — E ((E(ulv))*) + E ((E(ulv))”) - Eu)? 
(u?) — (E(u))? 


var(u). 


E (var(u|v)) + var (E(ulv)) = E 
= E(u2)— 
E — 


Identities (1.8) and (1.9) also hold if u is a vector, in which case E(u) is a vector and var(u) 
a matrix. 


Transformation of variables 


It is common to transform a probability distribution from one parameterization to another. 
We review the basic result here for a probability density on a transformed space. For 
clarity, we use subscripts here instead of our usual generic notation, p(-). Suppose p,(u) is 
the density of the vector u, and we transform to v = f(u), where v has the same number 
of components as u. 

If pu is a discrete distribution, and f is a one-to-one function, then the density of v is 
given by 

Po(v) = Pul(f*(v)). 

If f is a many-to-one function, then a sum of terms appears on the right side of this 
expression for p,(v), with one term corresponding to each of the branches of the inverse 
function. 

If pu is a continuous distribution, and v = f(u) is a one-to-one transformation, then the 
joint density of the transformed vector is 


Po(v) = |J| pul(f~"(v)) 
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where |J| is the absolute value of the determinant of the Jacobian of the transformation 
u = f~+(v) as a function of v; the Jacobian J is the square matrix of partial derivatives 
(with dimension given by the number of components of u), with the (i, 7)th entry equal to 
Ou; /Ov;. Once again, if f is many-to-one, then p,(v) is a sum or integral of terms. 

In one dimension, we commonly use the logarithm to transform the parameter space 
from (0,00) to (—o0, co). When working with parameters defined on the open unit interval, 
(0,1), we often use the logistic transformation: 


logit(u) = log (; 2 ) , (1.10) 


whose inverse transformation is 


e? 


logit™! (v) = ; 
a eT 
Another common choice is the probit transformation, P7! (u), where ® is the standard 
normal cumulative distribution function, to transform from (0,1) to (—0oo, 0). 


1.9 Computation and software 


At the time of writing, the authors rely primarily on the software package R for graphs and 
basic simulations, fitting of classical simple models (including regression, generalized linear 
models, and nonparametric methods such as locally weighted regression), optimization, and 
some simple programming. We use the Bayesian inference package Stan (see Appendix C) 
for fitting most models, but for teaching purposes in this book we describe how to perform 
most of the computations from first principles. Even when using Stan, we typically work 
within R to plot and transform the data before model fitting, and to display inferences and 
model checks afterwards. 
Specific computational tasks that arise in Bayesian data analysis include: 


e Vector and matrix manipulations (see Table 1.1) 
e Computing probability density functions (see Appendix A) 


e Drawing simulations from probability distributions (see Appendix A for standard distri- 
butions and Exercise 1.9 for an example of a simple stochastic process) 


e Structured programming (including looping and customized functions) 
e Calculating the linear regression estimate and variance matrix (see Chapter 14) 


e Graphics, including scatterplots with overlain lines and multiple graphs per page (see 
Chapter 6 for examples). 


Our general approach to computation is to fit many models, gradually increasing the 
complexity. We do not recommend the strategy of writing a model and then letting the 
computer run overnight to estimate it perfectly. Rather, we prefer to fit each model rela- 
tively quickly, using inferences from the previously fitted simpler models as starting values, 
and displaying inferences and comparing to data before continuing. 

We discuss computation in detail in Part III of this book after first introducing the 
fundamental concepts of Bayesian modeling, inference, and model checking. Appendix C 
illustrates how to perform computations in R and Stan in several different ways for a single 
example. 


Summarizing inferences by simulation 


Simulation forms a central part of much applied Bayesian analysis, because of the relative 
ease with which samples can often be generated from a probability distribution, even when 
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the density function cannot be explicitly integrated. In performing simulations, it is helpful 
to consider the duality between a probability density function and a histogram of a set of 
random draws from the distribution: given a large enough sample, the histogram can pro- 
vide practically complete information about the density, and in particular, various sample 
moments, percentiles, and other summary statistics provide estimates of any aspect of the 
distribution, to a level of precision that can be estimated. For example, to estimate the 
95th percentile of the distribution of 6, draw a random sample of size S from p(@) and use 
the 0.95Sth order statistic. For most purposes, S = 1000 is adequate for estimating the 
95th percentile in this way. 

Another advantage of simulation is that extremely large or small simulated values often 
flag a problem with model specification or parameterization (for example, see Figure 4.2) 
that might not be noticed if estimates and probability statements were obtained in analytic 
form. 

Generating values from a probability distribution is often straightforward with modern 
computing techniques based on (pseudo)random number sequences. A well-designed pseu- 
dorandom number generator yields a deterministic sequence that appears to have the same 
properties as a sequence of independent random draws from the uniform distribution on 
[0,1]. Appendix A describes methods for drawing random samples from some commonly 
used distributions. 


Sampling using the inverse cumulative distribution function 


As an introduction to the ideas of simulation, we describe a method for sampling from 
discrete and continuous distributions using the inverse cumulative distribution function. 
The cumulative distribution function, or cdf, F, of a one-dimensional distribution, p(v), is 
defined by 


F(vs,) = Pr(v <v) 


Xo<v, P(v) if p is discrete 
= Ux . . r 
p(v)dv if p is continuous. 


The inverse cdf can be used to obtain random samples from the distribution p, as 
follows. First draw a random value, U, from the uniform distribution on [0,1], using a table 
of random numbers or, more likely, a random number function on the computer. Now let 
v = F- (U). The function F is not necessarily one-to-one—certainly not if the distribution 
is discrete—but F~!(U) is unique with probability 1. The value v will be a random draw 
from p, and is easy to compute as long as F~'(U) is simple. For a discrete distribution, 
F~' can simply be tabulated. 

For a continuous example, suppose v has an exponential distribution with parameter A 
(see Appendix A); then its cdf is F(v) = 1—e-*”, and the value of v for which U = F (v) is 
v= log U) Then, recognizing that 1—U also has the uniform distribution on [0, 1], we 
see we can obtain random draws from the exponential distribution as — weg We discuss 
other methods of simulation in Part III of the book and Appendix A. 


Simulation of posterior and posterior predictive quantities 


In practice, we are most often interested in simulating draws from the posterior distribu- 
tion of the model parameters 0, and perhaps from the posterior predictive distribution of 
unknown observables y. Results from a set of S simulation draws can be stored in the 
computer in an array, as illustrated in Table 1.1. We use the notation s = 1,...,5 to in- 
dex simulation draws; (05, 7°) is the corresponding joint draw of parameters and predicted 
quantities from their joint posterior distribution. 
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Table 1.1 Structure of posterior and posterior predictive simulations. The superscripts are indexes, 
not powers. 


From these simulated values, we can estimate the posterior distribution of any quantity 
of interest, such as 81/03, by just computing a new column in Table 1.1 using the existing S 
draws of (0,4). We can estimate the posterior probability of any event, such as Pr(gy +92 > 
e1), by the proportion of the S simulations for which it is true. We are often interested in 
posterior intervals; for example, the central 95% posterior interval [a,b] for the parameter 
0j, for which Pr(6; < a) = 0.025 and Pr(6; > b) = 0.025. These values can be directly 
estimated by the appropriate simulated values of 0;, for example, the 25th and 976th order 
statistics if S=1000. We commonly summarize inferences by 50% and 95% intervals. 

We return to the accuracy of simulation inferences in Section 10.5 after we have gained 
some experience using simulations of posterior distributions in some simple examples. 


1.10 Bayesian inference in applied statistics 


A pragmatic rationale for the use of Bayesian methods is the inherent flexibility introduced 
by their incorporation of multiple levels of randomness and the resultant ability to combine 
information from different sources, while incorporating all reasonable sources of uncertainty 
in inferential summaries. Such methods naturally lead to smoothed estimates in complicated 
data structures and consequently have the ability to obtain better real-world answers. 

Another reason for focusing on Bayesian methods is more psychological, and involves the 
relationship between the statistician and the client or specialist in the subject matter area 
who is the consumer of the statistician’s work. In many practical cases, clients will interpret 
interval estimates provided by statisticians as Bayesian intervals, that is, as probability 
statements about the likely values of unknown quantities conditional on the evidence in 
the data. Such direct probability statements require prior probability specifications for 
unknown quantities (or more generally, probability models for vectors of unknowns), and 
thus the kinds of answers clients will assume are being provided by statisticians, Bayesian 
answers, require full probability models—explicit or implicit. 

Finally, Bayesian inferences are conditional on probability models that invariably contain 
approximations in their attempt to represent complicated real-world relationships. If the 
Bayesian answers vary dramatically over a range of scientifically reasonable assumptions 
that are unassailable by the data, then the resultant range of possible conclusions must be 
entertained as legitimate, and we believe that the statistician has the responsibility to make 
the client aware of this fact. 

In this book, we focus on the construction of models (especially hierarchical ones, as 
discussed in Chapter 5 onward) to relate complicated data structures to scientific questions, 
checking the fit of such models, and investigating the sensitivity of conclusions to reasonable 
modeling assumptions. From this point of view, the strength of the Bayesian approach lies in 
(1) its ability to combine information from multiple sources (thereby in fact allowing greater 
‘objectivity’ in final conclusions), and (2) its more encompassing accounting of uncertainty 
about the unknowns in a statistical problem. 


This electronic edition is for non-commercial purposes only. 


1.11. BIBLIOGRAPHIC NOTE 25 


Other important themes, many of which are common to much modern applied statistical 
practice, whether formally Bayesian or not, are the following: 


e a willingness to use many parameters 


e hierarchical structuring of models, which is the essential tool for achieving partial pool- 
ing of estimates and compromising in a scientific way between alternative sources of 
information 


e model checking—not only by examining the internal goodness of fit of models to ob- 
served and possible future data, but also by comparing inferences about estimands and 
predictions of interest to substantive knowledge 


e an emphasis on inference in the form of distributions or at least interval estimates rather 
than simple point estimates 


e the use of simulation as the primary method of computation; the modern computational 
counterpart to a ‘joint probability distribution’ is a set of randomly drawn values, and a 
key tool for dealing with missing data is the method of multiple imputation (computation 
and multiple imputation are discussed in more detail in later chapters) 


e the use of probability models as tools for understanding and possibly improving data- 
analytic techniques that may not explicitly invoke a Bayesian model 


e the importance of including in the analysis as much background information as possible, 
so as to approximate the goal that data can be viewed as a random sample, conditional 
on all the variables in the model 


e the importance of designing studies to have the property that inferences for estimands 
of interest will be robust to model assumptions. 


1.11 Bibliographic note 


Several good introductory books have been written on Bayesian statistics, beginning with 
Lindley (1965), and continuing through Hoff (2009). Berry (1996) presents, from a Bayesian 
perspective, many of the standard topics for an introductory statistics textbook. Gill 
(2002) and Jackman (2009) introduce applied Bayesian statistics for social scientists, Kr- 
uschke (2011) introduces Bayesian methods for psychology researchers, and Christensen et 
al. (2010) supply a general introduction. Carlin and Louis (2008) cover the theory and 
applications of Bayesian inference, focusing on biological applications and connections to 
classical methods. Some resources for teaching Bayesian statistics include Sedlmeier and 
Gigerenzer (2001) and Gelman (1998, 2008b). 

The bibliographic notes at the ends of the chapters in this book refer to a variety of 
specific applications of Bayesian data analysis. Several review articles in the statistical 
literature, such as Breslow (1990) and Racine et al. (1986), have appeared that discuss, 
in general terms, areas of application in which Bayesian methods have been useful. The 
volumes edited by Gatsonis et al. (1993-2002) are collections of Bayesian analyses, including 
extensive discussions about choices in the modeling process and the relations between the 
statistical methods and the applications. 

The foundations of probability and Bayesian statistics are an important topic that we 
treat only briefly. Bernardo and Smith (1994) give a thorough review of the foundations 
of Bayesian models and inference with a comprehensive list of references. Jeffreys (1961) is 
a self-contained book about Bayesian statistics that comprehensively presents an inductive 
view of inference; Good (1950) is another important early work. Jaynes (1983) is a collection 
of reprinted articles that present a deductive view of Bayesian inference that we believe is 
similar to ours. Both Jeffreys and Jaynes focus on applications in the physical sciences. 
Jaynes (2003) focuses on connections between statistical inference and the philosophy of 
science and includes several examples of physical probability. 
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Gigerenzer and Hoffrage (1995) discuss the connections between Bayesian probability 
and frequency probabilities from a perspective similar to ours, and provide evidence that 
people can typically understand and compute best with probabilities that are expressed 
in the form of relative frequency. Gelman (1998) presents some classroom activities for 
teaching Bayesian ideas. 

De Finetti (1974) is an influential work that focuses on the crucial role of exchange- 
ability. More approachable discussions of the role of exchangeability in Bayesian inference 
are provided by Lindley and Novick (1981) and Rubin (1978a, 1987a). The non-Bayesian 
article by Draper et al. (1993) makes an interesting attempt to explain how exchangeable 
probability models can be justified in data analysis. Berger and Wolpert (1984) give a 
comprehensive discussion and review of the likelihood principle, and Berger (1985, Sections 
1.6, 4.1, and 4.12) reviews a range of philosophical issues from the perspective of Bayesian 
decision theory. 

Our own philosophy of Bayesian statistics appears in Gelman (2011) and Gelman and 
Shalizi (2013); for some contrasting views, see the discussion of that article, along with 
Efron (1986) and the discussions following Gelman (2008a). 

Pratt (1965) and Rubin (1984) discuss the relevance of Bayesian methods for applied 
statistics and make many connections between Bayesian and non-Bayesian approaches to 
inference. Further references on the foundations of statistical inference appear in Shafer 
(1982) and the accompanying discussion. Kahneman and Tversky (1972) and Alpert and 
Raiffa (1982) present the results of psychological experiments that assess the meaning of 
‘subjective probability’ as measured by people’s stated beliefs and observed actions. Lindley 
(1971a) surveys many different statistical ideas, all from the Bayesian perspective. Box and 
Tiao (1973) is an early book on applied Bayesian methods. They give an extensive treatment 
of inference based on normal distributions, and their first chapter, a broad introduction to 
Bayesian inference, provides a good counterpart to Chapters 1 and 2 of this book. 

The iterative process involving modeling, inference, and model checking that we present 
in Section 1.1 is discussed at length in the first chapter of Box and Tiao (1973) and also 
in Box (1980). Cox and Snell (1981) provide a more introductory treatment of these ideas 
from a less model-based perspective. 

Many good books on the mathematical aspects of probability theory are available, such 
as Feller (1968) and Ross (1983); these are useful when constructing probability models 
and working with them. O’Hagan (1988) has written an interesting introductory text on 
probability from an explicitly Bayesian point of view. 

Physical probability models for coin tossing are discussed by Keller (1986), Jaynes 
(2003), and Gelman and Nolan (2002b). The football example of Section 1.6 is discussed 
in more detail in Stern (1991); see also Harville (1980) and Glickman (1993) and Glickman 
and Stern (1998) for analyses of football scores not using the point spread. Related analyses 
of sports scores and betting odds appear in Stern (1997, 1998). For more background on 
sports betting, see Snyder (1975) and Rombola (1984). 

An interesting real-world example of probability assignment arose with the explosion 
of the Challenger space shuttle in 1986; Martz and Zimmer (1992), Dalal, Fowlkes, and 
Hoadley (1989), and Lavine (1991) present and compare various methods for assigning 
probabilities for space shuttle failures. (At the time of writing we are not aware of similar 
contributions relating to the more recent space accident in 2003.) The record-linkage ex- 
ample in Section 1.7 appears in Belin and Rubin (1995b), who discuss the mixture models 
and calibration techniques in more detail. The Census problem that motivated the record 
linkage is described by Hogan (1992). 

In all our examples, probabilities are assigned using statistical modeling and estimation, 
not by ‘subjective’ assessment. Dawid (1986) provides a general discussion of probability 
assignment, and Dawid (1982) discusses the connections between calibration and Bayesian 
probability assignment. 
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The graphical method of jittering, used in Figures 1.1 and 1.2 and elsewhere in this 
book, is discussed in Chambers et al. (1983). For information on the statistical packages R 
and Bugs, see Becker, Chambers, and Wilks (1988), R Project (2002), Fox (2002), Venables 
and Ripley (2002), and Spiegelhalter et al. (1994, 2003). 

Norvig (2007) describes the principles and details of the Bayesian spelling corrector. 


1.12 Exercises 


1. Conditional probability: suppose that if 0 = 1, then y has a normal distribution with 
mean 1 and standard deviation g, and if 0 = 2, then y has a normal distribution with 
mean 2 and standard deviation ø. Also, suppose Pr(@ = 1) = 0.5 and Pr(@ = 2) = 0.5. 


(a) For o = 2, write the formula for the marginal probability density for y and sketch it. 

(b) What is Pr(? = lly = 1), again supposing o = 2? 

(c) Describe how the posterior density of 0 changes in shape as ø is increased and as it is 
decreased. 


2. Conditional means and variances: show that (1.8) and (1.9) hold if u is a vector. 


3. Probability calculation for genetics (from Lindley, 1965): suppose that in each individual 
of a large population there is a pair of genes, each of which can be either x or X, that 
controls eye color: those with xx have blue eyes, while heterozygotes (those with Xx or 
xX) and those with XX have brown eyes. The proportion of blue-eyed individuals is p? 
and of heterozygotes is 2p(1 — p), where 0 < p < 1. Each parent transmits one of its 
own genes to the child; if a parent is a heterozygote, the probability that it transmits the 
gene of type X is 4. Assuming random mating, show that among brown-eyed children 
of brown-eyed parents, the expected proportion of heterozygotes is 2p/(1 + 2p). Suppose 
Judy, a brown-eyed child of brown-eyed parents, marries a heterozygote, and they have 
n children, all brown-eyed. Find the posterior probability that Judy is a heterozygote 
and the probability that her first grandchild has blue eyes. 

4. Probability assignment: we will use the football dataset to estimate some conditional 
probabilities about professional football games. There were twelve games with point 
spreads of 8 points; the outcomes in those games were: —7, —5, —3, —3, 1, 6, 7, 13, 15, 
16, 20, and 21, with positive values indicating wins by the favorite and negative values 
indicating wins by the underdog. Consider the following conditional probabilities: 


Pr(favorite wins | point spread = 8), 
Pr(favorite wins by at least 8| point spread = 8), 
Pr(favorite wins by at least 8| point spread = 8 and favorite wins). 


(a) Estimate each of these using the relative frequencies of games with a point spread of 
8. 

(b) Estimate each using the normal approximation for the distribution of (outcome — 
point spread). 

5. Probability assignment: the 435 U.S. Congressmembers are elected to two-year terms; 
the number of voters in an individual congressional election varies from about 50,000 to 
350,000. We will use various sources of information to estimate roughly the probability 
that at least one congressional election is tied in the next national election. 

(a) Use any knowledge you have about U.S. politics. Specify clearly what information you 
are using to construct this conditional probability, even if your answer is just a guess. 

(b) Use the following information: in the period 1900-1992, there were 20,597 congres- 
sional elections, out of which 6 were decided by fewer than 10 votes and 49 decided 
by fewer than 100 votes. 
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See Gelman, King, and Boscardin (1998), Mulligan and Hunter (2001), and Gelman, 
Katz, and Tuerlinckx (2002) for more on this topic. 


6. Conditional probability: approximately 1/125 of all births are fraternal twins and 1/300 
of births are identical twins. Elvis Presley had a twin brother (who died at birth). What 
is the probability that Elvis was an identical twin? (You may approximate the probability 
of a boy or girl birth as 4.) 

7. Conditional probability: the following problem is loosely based on the television game 
show Let’s Make a Deal. At the end of the show, a contestant is asked to choose one of 
three large boxes, where one box contains a fabulous prize and the other two boxes contain 
lesser prizes. After the contestant chooses a box, Monty Hall, the host of the show, 
opens one of the two boxes containing smaller prizes. (In order to keep the conclusion 
suspenseful, Monty does not open the box selected by the contestant.) Monty offers the 
contestant the opportunity to switch from the chosen box to the remaining unopened box. 
Should the contestant switch or stay with the original choice? Calculate the probability 
that the contestant wins under each strategy. This is an exercise in being clear about the 
information that should be conditioned on when constructing a probability judgment. 
See Selvin (1975) and Morgan et al. (1991) for further discussion of this problem. 


8. Subjective probability: discuss the following statement. ‘The probability of event E is 
considered “subjective” if two rational persons A and B can assign unequal probabilities 
to E, Pa(£) and Pg(E). These probabilities can also be interpreted as “conditional”: 
P(E) = P(E|I4) and Pg(E) = P(E|Ig), where I4 and Ig represent the knowledge 
available to persons A and B, respectively.’ Apply this idea to the following examples. 

(a) The probability that a ‘6’ appears when a fair die is rolled, where A observes the 
outcome of the die roll and B does not. 

(b) The probability that Brazil wins the next World Cup, where A is ignorant of soccer 
and B is a knowledgeable sports fan. 


9. Simulation of a queuing problem: a clinic has three doctors. Patients come into the 
clinic at random, starting at 9 a.m., according to a Poisson process with time parameter 
10 minutes: that is, the time after opening at which the first patient appears follows an 
exponential distribution with expectation 10 minutes and then, after each patient arrives, 
the waiting time until the next patient is independently exponentially distributed, also 
with expectation 10 minutes. When a patient arrives, he or she waits until a doctor 
is available. The amount of time spent by each doctor with each patient is a random 
variable, uniformly distributed between 5 and 20 minutes. The office stops admitting 
new patients at 4 p.m. and closes when the last patient is through with the doctor. 


(a) Simulate this process once. How many patients came to the office? How many had to 
wait for a doctor? What was their average wait? When did the office close? 


(b) Simulate the process 100 times and estimate the median and 50% interval for each of 
the summaries in (a). 


This electronic edition is for non-commercial purposes only. 


Chapter 2 


Single-parameter models 


Our first detailed discussion of Bayesian inference is in the context of statistical models 
where only a single scalar parameter is to be estimated; that is, the estimand @ is one- 
dimensional. In this chapter, we consider four fundamental and widely used one-dimensional 
models—the binomial, normal, Poisson, and exponential—and at the same time introduce 
important concepts and computational methods for Bayesian data analysis. 


2.1 Estimating a probability from binomial data 


In the simple binomial model, the aim is to estimate an unknown population proportion 
from the results of a sequence of ‘Bernoulli trials’; that is, data y1,...,yn, each of which is 
either 0 or 1. This problem provides a relatively simple but important starting point for 
the discussion of Bayesian inference. By starting with the binomial model, our discussion 
also parallels the very first published Bayesian analysis by Thomas Bayes in 1763, and his 
seminal contribution is still of interest. 

The binomial distribution provides a natural model for data that arise from a sequence 
of n exchangeable trials or draws from a large population where each trial gives rise to 
one of two possible outcomes, conventionally labeled ‘success’ and ‘failure.’ Because of the 
exchangeability, the data can be summarized by the total number of successes in the n 
trials, which we denote here by y. Converting from a formulation in terms of exchangeable 
trials to one using independent and identically distributed random variables is achieved 
naturally by letting the parameter 0 represent the proportion of successes in the population 
or, equivalently, the probability of success in each trial. The binomial sampling model is, 


p(y)9) = Bin(yln, 0) = C) wa — ay", (2.1) 


where on the left side we suppress the dependence on n because it is regarded as part of the 
experimental design that is considered fixed; all the probabilities discussed for this problem 
are assumed to be conditional on n. 


Example. Estimating the probability of a female birth 

As a specific application of the binomial model, we consider the estimation of the 
sex ratio within a population of human births. The proportion of births that are 
female has long been a topic of interest both scientifically and to the lay public. Two 
hundred years ago it was established that the proportion of female births in European 
populations was less than 0.5 (see Historical Note below), while in this century interest 
has focused on factors that may influence the sex ratio. The currently accepted value 
of the proportion of female births in large European-race populations is 0.485. 

For this example we define the parameter 0 to be the proportion of female births, but 
an alternative way of reporting this parameter is as a ratio of male to female birth 
rates, ġ = (1—0)/0. 

Let y be the number of girls in n recorded births. By applying the binomial model 
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Figure 2.1 Unnormalized posterior density for binomial parameter 0, based on uniform prior dis- 
tribution and y successes out of n trials. Curves displayed for several values of n and y. 


(2.1), we are assuming that the n births are conditionally independent given 0, with 
the probability of a female birth equal to @ for all cases. This modeling assumption 
is motivated by the exchangeability that may be judged to arise when we have no 
explanatory information (for example, distinguishing multiple births or births within 
the same family) that might affect the sex of the baby. 


To perform Bayesian inference in the binomial model, we must specify a prior distribu- 
tion for 0. We will discuss issues associated with specifying prior distributions many times 
throughout this book, but for simplicity at this point, we assume that the prior distribution 
for 0 is uniform on the interval (0, 1]. 

Elementary application of Bayes’ rule as displayed in (1.2), applied to (2.1), then gives 
the posterior density for 0 as 

p(dly) x P — 0)”. (2.2) 


With fixed n and y, the factor G) does not depend on the unknown parameter 6, and so it 
can be treated as a constant when calculating the posterior distribution of 0. As is typical 
of many examples, the posterior density can be written immediately in closed form, up to a 
constant of proportionality. In single-parameter problems, this allows immediate graphical 
presentation of the posterior distribution. For example, in Figure 2.1, the unnormalized 
density (2.2) is displayed for several different experiments, that is, different values of n and 
y. Each of the four experiments has the same proportion of successes, but the sample sizes 
vary. In the present case, we can recognize (2.2) as the unnormalized form of the beta 
distribution (see Appendix A), 


Oly ~ Beta(y + 1,n— y+ 1). (2.3) 


Historical note: Bayes and Laplace 

Many early writers on probability dealt with the elementary binomial model. The first 
contributions of lasting significance, in the 17th and early 18th centuries, concentrated 
on the ‘pre-data’ question: given 0, what are the probabilities of the various possible 
outcomes of the random variable y? For example, the ‘weak law of large numbers’ of 
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Jacob Bernoulli states that if y ~ Bin(n, 0), then Pr(|# — 0|>e| 8) > 0 as n + o, 
for any 0 and any fixed value of €e > 0. The Reverend Thomas Bayes, an English 
part-time mathematician whose work was unpublished during his lifetime, and Pierre 

Simon Laplace, an inventive and productive mathematical scientist whose massive 

output spanned the Napoleonic era in France, receive independent credit as the first 

to invert the probability statement and obtain probability statements about 0, given 

observed y. 

In his famous paper, published in 1763, Bayes sought, in our notation, the probability 

Pr(8 € (81, 02)|y); his solution was based on a physical analogy of a probability space 

to a rectangular table (such as a billiard table): 

1. (Prior distribution) A ball W is randomly thrown (according to a uniform distribu- 
tion on the table). The horizontal position of the ball on the table is 0, expressed 
as a fraction of the table width. 

2. (Likelihood) A ball O is randomly thrown n times. The value of y is the number 
of times O lands to the right of W. 

Thus, @ is assumed to have a (prior) uniform distribution on [0,1]. Using direct 

probability calculations which he derived in the paper, Bayes then obtained 


Pipetti = PEGE Or, 62),u) 


p(y) 
02 
JË DDPO) 
p(y) 
ofn P- 
a, o 02 (1 — 0)” d0 oi 
E ply) l 
Bayes succeeded in evaluating the denominator, showing that 
1 
ply) = J C) 6¥ (1 — 0)” -d0 (2.5) 
0 


for y= 0,...,n. 


n+1 
This calculation shows that all possible values of y are equally likely a priori. 
The numerator of (2.4) is an incomplete beta integral with no closed-form expression 
for large values of y and (n — y), a fact that apparently presented some difficulties for 
Bayes. 
Laplace, however, independently ‘discovered’ Bayes’ theorem, and developed new ana- 
lytic tools for computing integrals. For example, he expanded the function 6¥(1 — 6)"~¥ 
around its maximum at 6 = y/n and evaluated the incomplete beta integral using what 
we now know as the normal approximation. 
In analyzing the binomial model, Laplace also used the uniform prior distribution. His 
first serious application was to estimate the proportion of girl births in a population. 
A total of 241,945 girls and 251,527 boys were born in Paris from 1745 to 1770. Letting 
0 be the probability that any birth is female, Laplace showed that 


Pr(0 > 0.5|y = 241,945, n = 251,527 + 241,945) ~ 1.15 x 107-42, 


and so he was ‘morally certain’ that 0 < 0.5. 


Prediction 


In the binomial example with the uniform prior distribution, the prior predictive distribution 
can be evaluated explicitly, as we have already noted in (2.5). Under the model, all possible 
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values of y are equally likely, a priori. For posterior prediction from this model, we might 
be more interested in the outcome of one new trial, rather than another set of n new trials. 
Letting y denote the result of a new trial, exchangeable with the first n, 


1 
Pr(G=1y) = f P= ibati 
1 
_ _ ut 1 
= f er(6yya9 = BC) = 45, (2.6) 


from the properties of the beta distribution (see Appendix A). It is left as an exercise to 
reproduce this result using direct integration of (2.6). This result, based on the uniform 
prior distribution, is known as ‘Laplace’s law of succession.’ At the extreme observations 


=0 and y = n, Laplace’s law predicts probabilities of — and 24, respectively. 
y y 


n+2 n+2? 


2.2 Posterior as compromise between data and prior information 


The process of Bayesian inference involves passing from a prior distribution, p(@), to a 
posterior distribution, p(@|y), and it is natural to expect that some general relations might 
hold between these two distributions. For example, we might expect that, because the 
posterior distribution incorporates the information from the data, it will be less variable than 
the prior distribution. This notion is formalized in the second of the following expressions: 


E(@) = E(E(8|y)) (2.7) 


and 
var(0) = E(var(6|y)) + var(E(6|y)), (2.8) 


which are obtained by substituting (8, y) for the generic (u, v) in (1.8) and (1.9). The result 
expressed by Equation (2.7) is scarcely surprising: the prior mean of 0 is the average of all 
possible posterior means over the distribution of possible data. The variance formula (2.8) 
is more interesting because it says that the posterior variance is on average smaller than 
the prior variance, by an amount that depends on the variation in posterior means over 
the distribution of possible data. The greater the latter variation, the more the potential 
for reducing our uncertainty with regard to 0, as we shall see in detail for the binomial 
and normal models in the next chapter. The mean and variance relations only describe 
expectations, and in particular situations the posterior variance can be similar to or even 
larger than the prior variance (although this can be an indication of conflict or inconsistency 
between the sampling model and prior distribution). 

In the binomial example with the uniform prior distribution, the prior mean is 4, and 
the prior variance is b The posterior mean, u+, is a compromise between the prior mean 
and the sample proportion, #, where clearly the prior mean has a smaller and smaller role 
as the size of the data sample increases. This is a general feature of Bayesian inference: the 
posterior distribution is centered at a point that represents a compromise between the prior 
information and the data, and the compromise is controlled to a greater extent by the data 
as the sample size increases. 


2.3 Summarizing posterior inference 


The posterior probability distribution contains all the current information about the pa- 
rameter 0. Ideally one might report the entire posterior distribution p(6|y); as we have seen 
in Figure 2.1, a graphical display is useful. In Chapter 3, we use contour plots and scat- 
terplots to display posterior distributions in multiparameter problems. A key advantage of 
the Bayesian approach, as implemented by simulation, is the flexibility with which posterior 
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Figure 2.2 Hypothetical density for which the 95% central interval and 95% highest posterior density 
region dramatically differ: (a) central posterior interval, (b) highest posterior density region. 


inferences can be summarized, even after complicated transformations. This advantage is 
most directly seen through examples, some of which will be presented shortly. 

For many practical purposes, however, various numerical summaries of the distribu- 
tion are desirable. Commonly used summaries of location are the mean, median, and 
mode(s) of the distribution; variation is commonly summarized by the standard deviation, 
the interquartile range, and other quantiles. Each summary has its own interpretation: for 
example, the mean is the posterior expectation of the parameter, and the mode may be 
interpreted as the single ‘most likely’ value, given the data (and the model). Furthermore, 
as we shall see, much practical inference relies on the use of normal approximations, often 
improved by applying a symmetrizing transformation to 0, and here the mean and the stan- 
dard deviation play key roles. The mode is important in computational strategies for more 
complex problems because it is often easier to compute than the mean or median. 

When the posterior distribution has a closed form, such as the beta distribution in 
the current example, summaries such as the mean, median, and standard deviation of 
the posterior distribution are often available in closed form. For example, applying the 
distributional results in Appendix A, the mean of the beta distribution in (2.3) is uy and 
the mode is #, which is well known from different points of view as the maximum likelihood 
and (minimum variance) unbiased estimate of 0. 


Posterior quantiles and intervals 


In addition to point summaries, it is nearly always important to report posterior uncertainty. 
Our usual approach is to present quantiles of the posterior distribution of estimands of 
interest or, if an interval summary is desired, a central interval of posterior probability, 
which corresponds, in the case of a 100(1 — a)% interval, to the range of values above and 
below which lies exactly 100(a@/2)% of the posterior probability. Such interval estimates 
are referred to as posterior intervals. For simple models, such as the binomial and normal, 
posterior intervals can be computed directly from cumulative distribution functions, often 
using calls to standard computer functions, as we illustrate in Section 2.4 with the example 
of the human sex ratio. In general, intervals can be computed using computer simulations 
from the posterior distribution, as described at the end of Section 1.9. 

A slightly different summary of posterior uncertainty is the highest posterior density 
region: the set of values that contains 100(1 — a)% of the posterior probability and also 
has the characteristic that the density within the region is never lower than that outside. 
Such a region is identical to a central posterior interval if the posterior distribution is 
unimodal and symmetric. In current practice, the central posterior interval is in common 
use, partly because it has a direct interpretation as the posterior a/2 and 1—a/2 quantiles, 
and partly because it is directly computed using posterior simulations. Figure 2.2 shows 
a case where different posterior summaries look much different: the 95% central interval 
includes the area of zero probability in the center of the distribution, whereas the 95% 
highest posterior density region comprises two disjoint intervals. In this situation, the 
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highest posterior density region is more cumbersome but conveys more information than 
the central interval; however, it is probably better not to try to summarize this bimodal 
density by any single interval. The central interval and the highest posterior density region 
can also differ substantially when the posterior density is highly skewed. 


2.4 Informative prior distributions 


In the binomial example, we have so far considered only the uniform prior distribution for 
0. How can this specification be justified, and how in general do we approach the problem 
of constructing prior distributions? 

We consider two basic interpretations that can be given to prior distributions. In the 
population interpretation, the prior distribution represents a population of possible parame- 
ter values, from which the 0 of current interest has been drawn. In the more subjective state 
of knowledge interpretation, the guiding principle is that we must express our knowledge 
(and uncertainty) about @ as if its value could be thought of as a random realization from 
the prior distribution. For many problems, such as estimating the probability of failure in 
a new industrial process, there is no perfectly relevant population of 6’s from which the 
current 0 has been drawn, except in hypothetical contemplation. Typically, the prior distri- 
bution should include all plausible values of 6, but the distribution need not be realistically 
concentrated around the true value, because often the information about 0 contained in the 
data will far outweigh any reasonable prior probability specification. 

In the binomial example, we have seen that the uniform prior distribution for 0 im- 
plies that the prior predictive distribution for y (given n) is uniform on the discrete set 
{0,1,...,n}, giving equal probability to the n+ 1 possible values. In his original treatment 
of this problem (described in the Historical Note in Section 2.1), Bayes’ justification for the 
uniform prior distribution appears to have been based on this observation; the argument 
is appealing because it is expressed entirely in terms of the observable quantities y and n. 
Laplace’s rationale for the uniform prior density was less clear, but subsequent interpre- 
tations ascribe to him the so-called ‘principle of insufficient reason,’ which claims that a 
uniform specification is appropriate if nothing is known about 0. We shall discuss in Section 
2.8 the weaknesses of the principle of insufficient reason as a general approach for assigning 
probability distributions. 

At this point, we discuss some of the issues that arise in assigning a prior distribution 
that reflects substantive information. 


Binomial example with different prior distributions 


We first pursue the binomial model in further detail using a parametric family of prior 
distributions that includes the uniform as a special case. For mathematical convenience, we 
construct a family of prior densities that lead to simple posterior densities. 

Considered as a function of 0, the likelihood (2.1) is of the form, 


p(y) x 6° (1 — 8)”. 


Thus, if the prior density is of the same form, with its own values a and b, then the posterior 
density will also be of this form. We will parameterize such a prior density as 


pO) x g= *(1 — 8)P™?, 


which is a beta distribution with parameters a and 8: 0 ~ Beta(a, 8). Comparing p(@) and 
p(y|@) suggests that this prior density is equivalent to a — 1 prior successes and  — 1 prior 
failures. The parameters of the prior distribution are often referred to as hyperparameters. 
The beta prior distribution is indexed by two hyperparameters, which means we can specify 
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a particular prior distribution by fixing two features of the distribution, for example its mean 
and variance; see (A.3) on page 585. 

For now, assume that we can select reasonable values a and 8. Appropriate methods 
for working with unknown hyperparameters in certain problems are described in Chapter 
5. The posterior density for 0 is 


ploy) x O1- 0101 — 0) 
— gyte-tiy _ gy et 


= Beta(blat+y,B+n—y). 


The property that the posterior distribution follows the same parametric form as the 
prior distribution is called conjugacy; the beta prior distribution is a conjugate family for 
the binomial likelihood. The conjugate family is mathematically convenient in that the 
posterior distribution follows a known parametric form. If information is available that 
contradicts the conjugate parametric family, it may be necessary to use a more realistic, if 
inconvenient, prior distribution (just as the binomial likelihood may need to be replaced by 
a more realistic likelihood in some cases). 

To continue with the binomial model with beta prior distribution, the posterior mean of 
0, which may be interpreted as the posterior probability of success for a future draw from 
the population, is now 

at+y 
EGly) = Sgn’ 
which always lies between the sample proportion, y/n, and the prior mean, a/(a@ + 8); see 
Exercise 2.5b. The posterior variance is 


___(@+y(6+n-y)  _ EOW- EO) 
lS a e a a os eens * 


As y and n — y become large with fixed a and 6, E(6|y) ~ y/n and var(6|y) ~ +4(1 — 4), 
which approaches zero at the rate 1/n. In the limit, the parameters of the prior distribution 
have no influence on the posterior distribution. 

In fact, as we shall see in more detail in Chapter 4, the central limit theorem of proba- 


bility theory can be put in a Bayesian context to show: 


— 
var (Oy) 


s) — N(0,1). 


This result is often used to justify approximating the posterior distribution with a normal 
distribution. For the binomial parameter 0, the normal distribution is a more accurate 
approximation in practice if we transform 0 to the logit scale; that is, performing inference 
for log(@/(1 — @)) instead of 0 itself, thus expanding the probability space from [0,1] to 
(—oo, 00), which is more fitting for a normal approximation. 


Conjugate prior distributions 


Conjugacy is formally defined as follows. If F is a class of sampling distributions p(y|@), 
and P is a class of prior distributions for 0, then the class P is conjugate for F if 


p(Aly) € P for all p(-|0) € F and p(-) € P. 


This definition is formally vague since if we choose P as the class of all distributions, then 
P is always conjugate no matter what class of sampling distributions is used. We are most 


This electronic edition is for non-commercial purposes only. 


36 2. SINGLE-PARAMETER MODELS 


interested in natural conjugate prior families, which arise by taking P to be the set of all 
densities having the same functional form as the likelihood. 

Conjugate prior distributions have the practical advantage, in addition to computational 
convenience, of being interpretable as additional data, as we have seen for the binomial 
example and will also see for the normal and other standard models in Sections 2.5 and 2.6. 


Nonconjugate prior distributions 


The basic justification for the use of conjugate prior distributions is similar to that for using 
standard models (such as binomial and normal) for the likelihood: it is easy to understand 
the results, which can often be put in analytic form, they are often a good approximation, 
and they simplify computations. Also, they will be useful later as building blocks for more 
complicated models, including in many dimensions, where conjugacy is typically impossible. 
For these reasons, conjugate models can be good starting points; for example, mixtures of 
conjugate families can sometimes be useful when simple conjugate distributions are not 
reasonable (see Exercise 2.4). 

Although they can make interpretations of posterior inferences less transparent and 
computation more difficult, nonconjugate prior distributions do not pose any new conceptual 
problems. In practice, for complicated models, conjugate prior distributions may not even 
be possible. Section 2.4 and Exercises 2.10 and 2.11 present examples of nonconjugate 
computation; a more extensive nonconjugate example, an analysis of a bioassay experiment, 
appears in Section 3.7. 


Conjugate prior distributions, exponential families, and sufficient statistics 


We close this section by relating conjugate families of distributions to the classical concepts 
of exponential families and sufficient statistics. Readers who are unfamiliar with these 
concepts can skip ahead to the example with no loss. 

Probability distributions that belong to an exponential family have natural conjugate 
prior distributions, so we digress at this point to review the definition of exponential families; 
for complete generality in this section, we allow data points y; and parameters 0 to be 
multidimensional. The class F is an exponential family if all its members have the form, 


p(yil9) = Fl) gO) et “9, 


The factors ¢(@) and u(y;) are, in general, vectors of equal dimension to that of 6. The 
vector (0) is called the ‘natural parameter’ of the family F. The likelihood corresponding 


to a sequence y = (Yi,---, Yn) of independent and identically distributed observations is 
p(y|@) = (i fw) g(0)” exp (sor D. uo) . 
i=l i=1 


For all n and y, this has a fixed form (as a function of @): 


p(yl0) œ g(0) ett), where t(y) = X` ulyi). 


i=l 


The quantity t(y) is said to be a sufficient statistic for 0, because the likelihood for 0 
depends on the data y only through the value of t(y). Sufficient statistics are useful in 
algebraic manipulations of likelihoods and posterior distributions. If the prior density is 
specified as 

p(B) x O 
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then the posterior density is 
p(y) x gO)" POT UHO), 


which shows that this choice of prior density is conjugate. It has been shown that, in general, 
the exponential families are the only classes of distributions that have natural conjugate 
prior distributions, since, apart from certain irregular cases, the only distributions having 
a fixed number of sufficient statistics for all n are of the exponential type. We have already 
discussed the binomial distribution, where for the likelihood p(y|0, n) = Bin(y|n, 0) with n 
known, the conjugate prior distributions on @ are beta distributions. It is left as an exercise 
to show that the binomial is an exponential family with natural parameter logit(@). 


Example. Probability of a girl birth given placenta previa 

As a specific example of a factor that may influence the sex ratio, we consider the 
maternal condition placenta previa, an unusual condition of pregnancy in which the 
placenta is implanted low in the uterus, obstructing the fetus from a normal vaginal 
delivery. An early study concerning the sex of placenta previa births in Germany found 
that of a total of 980 births, 437 were female. How much evidence does this provide 
for the claim that the proportion of female births in the population of placenta previa 
births is less than 0.485, the proportion of female births in the general population? 


Analysis using a uniform prior distribution. Under a uniform prior distribution for 
the probability of a girl birth, the posterior distribution is Beta(438,544). Exact 
summaries of the posterior distribution can be obtained from the properties of the 
beta distribution (Appendix A): the posterior mean of 0 is 0.446 and the posterior 
standard deviation is 0.016. Exact posterior quantiles can be obtained using numerical 
integration of the beta density, which in practice we perform by a computer function 
call; the median is 0.446 and the central 95% posterior interval is [0.415, 0.477]. This 
95% posterior interval matches, to three decimal places, the interval that would be 
obtained by using a normal approximation with the calculated posterior mean and 
standard deviation. Further discussion of the approximate normality of the posterior 
distribution is given in Chapter 4. 

In many situations it is not feasible to perform calculations on the posterior density 
function directly. In such cases it can be particularly useful to use simulation from the 
posterior distribution to obtain inferences. The first histogram in Figure 2.3 shows the 
distribution of 1000 draws from the Beta(438, 544) posterior distribution. An estimate 
of the 95% posterior interval, obtained by taking the 25th and 976th of the 1000 
ordered draws, is [0.415, 0.476], and the median of the 1000 draws from the posterior 
distribution is 0.446. The sample mean and standard deviation of the 1000 draws are 
0.445 and 0.016, almost identical to the exact results. A normal approximation to the 
95% posterior interval is [0.445 + 1.96 - 0.016] = [0.414, 0.476]. Because of the large 
sample and the fact that the distribution of 0 is concentrated away from zero and one, 
the normal approximation works well in this example. 

As already noted, when estimating a proportion, the normal approximation is gener- 
ally improved by applying it to the logit transform, log(~4), which transforms the 
parameter space from the unit interval to the real line. The second histogram in Figure 
2.3 shows the distribution of the transformed draws. The estimated posterior mean 
and standard deviation on the logit scale based on 1000 draws are —0.220 and 0.065. 
A normal approximation to the 95% posterior interval for 6 is obtained by inverting 
the 95% interval on the logit scale [—0.220 + 1.96 - 0.065], which yields [0.414, 0.477] 
on the original scale. The improvement from using the logit scale is most noticeable 
when the sample size is small or the distribution of 0 includes values near zero or one. 
In any real data analysis, it is important to keep the applied context in mind. The pa- 
rameter of interest in this example is traditionally expressed as the ‘sex ratio,’ (1—@)/0, 
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0.35 0.45 0.55 -0.5 -0.3 -0.1 0.1 1.0 1.2 1.4 1.6 


Figure 2.3 Draws from the posterior distribution of (a) the probability of female birth, 0; (b) the 
logit transform, logit(@); (c) the male-to-female sex ratio, ¢ = (1 — 0)/0. 


Parameters of the Summaries of the 
prior distribution posterior distribution 
Posterior 95% posterior 
ate a+ | median of @ interval for 0 
0.500 2 0.446 0.415, 0.477 
0.485 2 0.446 0.415, 0.477 
0.485 5 0.446 0.415, 0.477 
0.485 10 0.446 0.415, 0.477 
0.485 20 0.447 0.416, 0.478 
0.485 100 0.450 0.420, 0.479 
0.485 200 0.453 0.424, 0.481 


Table 2.1 Summaries of the posterior distribution of 0, the probability of a girl birth given placenta 
previa, under a variety of conjugate prior distributions. 


the ratio of male to female births. The posterior distribution of the ratio is illustrated 
in the third histogram. The posterior median of the sex ratio is 1.24, and the 95% 
posterior interval is [1.10, 1.41]. The posterior distribution is concentrated on values 
far above the usual European-race sex ratio of 1.06, implying that the probability of 
a female birth given placenta previa is less than in the general population. 


Analysis using different conjugate prior distributions. The sensitivity of posterior 
inference about 0 to the proposed prior distribution is exhibited in Table 2.1. The 
first row corresponds to the uniform prior distribution, a=1, 8 =1, and subsequent 
rows of the table use prior distributions that are increasingly concentrated around 
0.485, the proportion of female births in the general population. The first column 
shows the prior mean for 0, and the second column indexes the amount of prior 
information, as measured by a+ 8; recall that a+ 6 — 2 is, in some sense, equivalent 
to the number of prior observations. Posterior inferences based on a large sample are 
not particularly sensitive to the prior distribution. Only at the bottom of the table, 
where the prior distribution contains information equivalent to 100 or 200 births, are 
the posterior intervals pulled noticeably toward the prior distribution, and even then, 
the 95% posterior intervals still exclude the prior mean. 


Analysis using a nonconjugate prior distribution. As an alternative to the conjugate 
beta family for this problem, we might prefer a prior distribution that is centered 
around 0.485 but is flat far away from this value to admit the possibility that the 
truth is far away. The piecewise linear prior density in Figure 2.4a is an example 
of a prior distribution of this form; 40% of the probability mass is outside the inter- 
val [0.385, 0.585]. This prior distribution has mean 0.493 and standard deviation 0.21, 
similar to the standard deviation of a beta distribution with a+( = 5. The unnormal- 
ized posterior distribution is obtained at a grid of @ values, (0.000, 0.001, . . . , 1.000), 
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prior density posterior simulations 


00 02 04 06 08 1.0 0.35 040 0.45 0.50 0.55 
theta theta 


Figure 2.4 (a) Prior density for 0 in an example nonconjugate analysis of birth ratio example; (b) 
histogram of 1000 draws from a discrete approximation to the posterior density. Figures are plotted 
on different scales. 


by multiplying the prior density and the binomial likelihood at each point. Poste- 
rior simulations can be obtained by normalizing the distribution on the discrete grid 
of 0 values. Figure 2.4b is a histogram of 1000 draws from the discrete posterior 
distribution. The posterior median is 0.448, and the 95% central posterior interval is 
[0.419, 0.480]. Because the prior distribution is overwhelmed by the data, these results 
match those in Table 2.1 based on beta distributions. In taking the grid approach, it 
is important to avoid grids that are too coarse and distort a significant portion of the 
posterior mass. 


2.5 Normal distribution with known variance 


The normal distribution is fundamental to most statistical modeling. The central limit 
theorem helps to justify using the normal likelihood in many statistical problems, as an 
approximation to a less analytically convenient actual likelihood. Also, as we shall see in 
later chapters, even when the normal distribution does not itself provide a good model fit, 
it can be useful as a component of a more complicated model involving t or finite mixture 
distributions. For now, we simply work through the Bayesian results assuming the normal 
model is appropriate. We derive results first for a single data point and then for the general 
case of many data points. 


Likelihood of one data point 


As the simplest first case, consider a single scalar observation y from a normal distribution 
parameterized by a mean 0 and variance o”, where for this initial development we assume 
that g? is known. The sampling distribution is 


1 ee _ 92 
p(yl0) = ae zaz 0—0) 


Conjugate prior and posterior distributions 


Considered as a function of 0, the likelihood is an exponential of a quadratic form in 0, so 
the family of conjugate prior densities looks like 


p0) = e4? +B0+0. 
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We parameterize this family as 


v(0) x exp ( -310 — po)*) ; 


2 
276 


that is, 9 ~ N(uo, Tẹ), with hyperparameters uo and rẹ. As usual in this preliminary 
development, we assume that the hyperparameters are known. 

The conjugate prior density implies that the posterior distribution for 0 is the exponential 
of a quadratic form and thus normal, but some algebra is required to reveal its specific 
form. In the posterior density, all variables except 0 are regarded as constants, giving the 
conditional density, 


p(O|y) x exp (-5 (> 7 — | 


Expanding the exponents, collecting terms and then completing the square in 8 (see Exercise 
2.14(a) for details) gives 


1 
p( Gly) x exp (— (0 m)*) (2.9) 
Ti 
that is, O]y ~ N(m1,T?), where 
Gabo + sey g ata (2.10) 
Hı = ae aa m E3 an me = e zz’ . 


Precisions of the prior and posterior distributions. In manipulating normal distributions, 
the inverse of the variance plays a prominent role and is called the precision. The algebra 
above demonstrates that for normal data and normal prior distribution (each with known 
precision), the posterior precision equals the prior precision plus the data precision. 

There are several different ways of interpreting the form of the posterior mean, pı. In 
(2.10), the posterior mean is expressed as a weighted average of the prior mean and the 
observed value, y, with weights proportional to the precisions. Alternatively, we can express 
[41 as the prior mean adjusted toward the observed y, 


2 


pa = Ho + (¥ — Ho) a 
or as the data ‘shrunk’ toward the prior mean, 
= y— (y— Mo) 
Hı ~~ y y Ho g2 a Të s 


Each formulation represents the posterior mean as a compromise between the prior mean 
and the observed value. 
At the extremes, the posterior mean equals the prior mean or the observed data: 


=o if y=po or 7) = 0; 
fi=y if y=po or o? =0. 


If rẹ = 0, the prior distribution is infinitely more precise than the data, and so the posterior 
and prior distributions are identical and concentrated at the value uo. If o? = 0, the data 
are perfectly precise, and the posterior distribution is concentrated at the observed value, 
y. If y = mo, the prior and data means coincide, and the posterior mean must also fall at 
this point. 
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Posterior predictive distribution 


The posterior predictive distribution of a future observation, J, p(y|y), can be calculated 
directly by integration, using (1.4): 


oly) = J p(gl)p(6ly)a6 


n fo (0 -— n) a (- = can ’) dd. 


The first line above holds because the distribution of the future observation, y, given 0, 
does not depend on the past data, y. We can determine the distribution of y more easily 
using the properties of the bivariate normal distribution. The product in the integrand is 
the exponential of a quadratic function of (g, 0); hence y and @ have a joint normal posterior 
distribution, and so the marginal posterior distribution of y is normal. 

We can determine the mean and variance of the posterior predictive estriba using 
the knowledge from the posterior distribution that E(g|@) = @ and var(g|@) = o7, along 
with identities (2.7) and (2.8): 


E(gly) = BEA, y)|y) = Ely) = m, 
and 
E(var(y|6, y)|y) + var(E(g/4, y)ly) 


= E(o"ly) + var(4ly) 
o? + ce 


var(yly) 


Thus, the posterior predictive distribution of y has mean equal to the posterior mean of 6 
and two components of variance: the predictive variance g? from the model and the variance 
T? due to posterior uncertainty in 8. 


Normal model with multiple observations 


This development of the normal model with a single observation can be easily extended 
to the more realistic situation where a sample of independent and identically distributed 


observations y = (y1,---,Yn) is available. Proceeding formally, the posterior density is 
ply) œ~ p()p(yl@) 
= p(9) |] ruld) 
i=1 


x ow (0 ni") Flom (tor) 


x me uo)? +a u- , oe) ). 


Algebraic simplification of this expression (along similar lines to those used in the single 
observation case, as explicated in Exercise 2.14(b)) shows that the posterior distribution 
depends on y only through the sample mean, 7 = DF yi; that is, 7 is a sufficient statistic 
in this model. In fact, since 9|9,07 ~ N(0,07/n), the results derived for the single normal 
observation apply immediately (treating 7 as the single observation) to give 


p(Oly1 -> -+ Yn) = PAID) = N@|un, Tr); (2.11) 
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where ; 
zo + ox 1 1 n 
ee: ge 2.12 
j a+ ae BB t (ata) 


Incidentally, the same result is obtained by adding information for the data y1, y2,.--,Yn 
one point at a time, using the posterior distribution at each step as the prior distribution 
for the next (see Exercise 2.14(c)). 

In the expressions for the posterior mean and variance, the prior precision, 1/ Tos and 
the data precision, n/o?, play equivalent roles, so if n is large, the posterior distribution 
is largely determined by o? and the sample value y. For example, if 7 = 07, then the 
prior distribution has the same weight as one extra observation with the value uo. More 
specifically, as To — co with n fixed, or as n > œo with rë fixed, we have: 


p(Oly) = NOI, o” /n), (2.13) 


which is, in practice, a good approximation whenever prior beliefs are relatively diffuse over 
the range of 0 where the likelihood is substantial. 


2.6 Other standard single-parameter models 


Recall that, in general, the posterior density, p(@|y), has no closed-form expression; the 
normalizing constant, p(y), is often especially difficult to compute due to the integral (1.3). 
Much formal Bayesian analysis concentrates on situations where closed forms are available; 
such models are sometimes unrealistic, but their analysis often provides a useful starting 
point when it comes to constructing more realistic models. 

The standard distributions—binomial, normal, Poisson, and exponential—have natural 
derivations from simple probability models. As we have already discussed, the binomial 
distribution is motivated from counting exchangeable outcomes, and the normal distribu- 
tion applies to a random variable that is the sum of many exchangeable or independent 
terms. We will also have occasion to apply the normal distribution to the logarithm of all- 
positive data, which would naturally apply to observations that are modeled as the product 
of many independent multiplicative factors. The Poisson and exponential distributions arise 
as the number of counts and the waiting times, respectively, for events modeled as occur- 
ring exchangeably in all time intervals; that is, independently in time, with a constant rate 
of occurrence. We will generally construct realistic probability models for more compli- 
cated outcomes by combinations of these basic distributions. For example, in Section 22.2, 
we model the reaction times of schizophrenic patients in a psychological experiment as a 
binomial mixture of normal distributions on the logarithmic scale. 

Each of these standard models has an associated family of conjugate prior distributions, 
which we discuss in turn. 


Normal distribution with known mean but unknown variance 


The normal model with known mean 0 and unknown variance is an important example, 
not necessarily for its direct applied value, but as a building block for more complicated, 
useful models, most immediately the normal distribution with unknown mean and variance, 
which we cover in Section 3.2. In addition, the normal distribution with known mean but 
unknown variance provides an introductory example of the estimation of a scale parameter. 

For p(y|0,07) = N(y|0,07), with 0 known and o? unknown, the likelihood for a vector 
y of n independent and identically distributed observations is 


p(ylo?) x o "exp (2u) 
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The sufficient statistic is 


The corresponding conjugate prior density is the inverse-gamma, 


pe ee ee, 
which has hyperparameters (a, 3). A convenient parameterization is as a scaled inverse-y? 
distribution with scale 02 and vo degrees of freedom (see Appendix A); that is, the prior 
distribution of ø? is taken to be the distribution of oĝvo/X, where X is a x7, random 
variable. We use the convenient but nonstandard notation, o? ~ Inv-y?(v, o). 

The resulting posterior density for ø? is 


P(o7|y) x p(a?)p(ylo”) 
g2 \ lat WOR \ pas nfs nv 
(2) ep (- 20? ) e o(a) 


1 
œx  (02)~((r+¥0)/2+1) exp (uo + mo)) : 
o 


2 


Thus, 
voog + w) 


o* ly ~ Inv-x? (» +n, 
wm +n 


which is a scaled inverse-x? distribution with scale equal to the degrees-of-freedom-weighted 
average of the prior and data scales and degrees of freedom equal to the sum of the prior 
and data degrees of freedom. The prior distribution can be thought of as providing the 
information equivalent to vp observations with average squared deviation oĉ. 


Poisson model 


The Poisson distribution arises naturally in the study of data taking the form of counts; 
for instance, a major area of application is epidemiology, where the incidence of diseases is 
studied. 

If a data point y follows the Poisson distribution with rate 0, then the probability 
distribution of a single observation y is 


ple? 
p(yl0) = 


, for y=0,1,2,..., 


and for a vector y = (y1,---,;Yn) of independent and identically distributed observations, 
the likelihood is 


n 


i ere 
pl) = [Ae 


i=1 7" 


x PeT”? 


where t(y) = J}; y: is the sufficient statistic. We can rewrite the likelihood in exponential 
family form as 


p(yld) oc e20) 088, 


This electronic edition is for non-commercial purposes only. 


44 2. SINGLE-PARAMETER MODELS 


revealing that the natural parameter is ¢(0) = log @, and the natural conjugate prior distri- 
bution is 

p(0) x (e "renee, 
indexed by hyperparameters (7, v). To put this argument another way, the likelihood is of 
the form 6%e~"*, and so the conjugate prior density must be of the form p(0) x 04e72®. In 
a more conventional parameterization, 


p(0) x e P8ge-, 


which is a gamma density with parameters a and 8, Gamma(a, 3); see Appendix A. Com- 
paring p(y|@) and p(@) reveals that the prior density is, in some sense, equivalent to a total 
count of a—1 in 8 prior observations. With this conjugate prior distribution, the posterior 
distribution is 

bly ~ Gamma(a + ny, 8 +n). 
The negative binomial distribution. With conjugate families, the known form of the prior 
and posterior densities can be used to find the marginal distribution, p(y), using the formula 


_ plylé)p(0) 
PY) = Ely) 


For instance, the Poisson model for a single observation, y, has prior predictive distribution 


Poisson(y|0)Gamma(6|a, 8) 
Matyo 
T(a)y!(1 + Bory’ 


w= ("2 (FY (eh) 


which is known as the negative binomial density: 


py) = 


which reduces to 


y ~ Neg-bin(a, 8). 


The above derivation shows that the negative binomial distribution is a mixture of Poisson 
distributions with rates, 0, that follow the gamma distribution: 


Neg-bin(y|a, 8) = f Poisson(ylo)Gamma(@)a, ab 


We return to the negative binomial distribution in Section 17.2 as a robust alternative to 
the Poisson distribution. 


Poisson model parameterized in terms of rate and exposure 


In many applications, it is convenient to extend the Poisson model for data points y,..., Yn 
to the form 
yi ~ Poisson(z;0), (2.14) 


where the values x; are known positive values of an explanatory variable, x, and 0 is the 
unknown parameter of interest. In epidemiology, the parameter 0 is often called the rate, 
and z; is called the exposure of the ith unit. This model is not exchangeable in the y;’s but 
is exchangeable in the pairs (x, y);. The likelihood for 0 in the extended Poisson model is 


O e 
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(ignoring factors that do not depend on 0), and so the gamma distribution for @ is conjugate. 
With prior distribution 
0 ~ Gamma(a, 8), 


the resulting posterior distribution is 


Oly ~ Gamma (o + Soy B+ $a) ; (2.15) 


i=1 i=l 


Estimating a rate from Poisson data: an idealized example 

Suppose that causes of death are reviewed in detail for a city in the United States for a 
single year. It is found that 3 persons, out of a population of 200,000, died of asthma, 
giving a crude estimated asthma mortality rate in the city of 1.5 cases per 100,000 
persons per year. A Poisson sampling model is often used for epidemiological data of 
this form. The Poisson model derives from an assumption of exchangeability among 
all small intervals of exposure. Under the Poisson model, the sampling distribution 
of y, the number of deaths in a city of 200,000 in one year, may be expressed as 
Poisson(2.0@), where 0 represents the true underlying long-term asthma mortality rate 
in our city (measured in cases per 100,000 persons per year). In the above notation, 
y = 3 is a single observation with exposure x = 2.0 (since @ is defined in units of 
100,000 people) and unknown rate 0. We can use knowledge about asthma mortality 
rates around the world to construct a prior distribution for 0 and then combine the 
datum y = 3 with that prior distribution to obtain a posterior distribution. 


Setting up a prior distribution. What is a sensible prior distribution for 0? Reviews 
of asthma mortality rates around the world suggest that mortality rates above 1.5 
per 100,000 people are rare in Western countries, with typical asthma mortality rates 
around 0.6 per 100,000. Trial-and-error exploration of the properties of the gamma dis- 
tribution, the conjugate prior family for this problem, reveals that a Gamma(3.0, 5.0) 
density provides a plausible prior density for the asthma mortality rate in this example 
if we assume exchangeability between this city and other cities and this year and other 
years. The mean of this prior distribution is 0.6 (with a mode of 0.4), and 97.5% of 
the mass of the density lies below 1.44. In practice, specifying a prior mean sets the 
ratio of the two gamma parameters, and then the shape parameter can be altered by 
trial and error to match the prior knowledge about the tail of the distribution. 


Posterior distribution. The result in (2.15) shows that the posterior distribution 
of 0 for a Gamma(a,) prior distribution is Gamma(a + y,3 + x) in this case. 
With the prior distribution and data described, the posterior distribution for 0 is 
Gamma(6.0, 7.0), which has mean 0.86—substantial shrinkage has occurred toward 
the prior distribution. A histogram of 1000 draws from the posterior distribution for 
0 is shown as Figure 2.5a. For example, the posterior probability that the long-term 
death rate from asthma in our city is more than 1.0 per 100,000 per year, computed 
from the gamma posterior density, is 0.30. 


Posterior distribution with additional data. To consider the effect of additional data, 
suppose that ten years of data are obtained for the city in our example, instead of just 
one, and it is found that the mortality rate of 1.5 per 100,000 is maintained; we find 
y = 30 deaths over 10 years. Assuming the population is constant at 200,000, and 
assuming the outcomes in the ten years are independent with constant long-term rate 
0, the posterior distribution of @ is then Gamma(33.0, 25.0); Figure 2.5b displays 1000 
draws from this distribution. The posterior distribution is much more concentrated 
than before, and it still lies between the prior distribution and the data. After ten 
years of data, the posterior mean of 0 is 1.32, and the posterior probability that 6 
exceeds 1.0 is 0.93. 
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Figure 2.5 Posterior density for 0, the asthma mortality rate in cases per 100,000 persons per year, 
with a Gamma(3.0, 5.0) prior distribution: (a) given y = 3 deaths out of 200,000 persons; (b) given 
y = 30 deaths in 10 years for a constant population of 200,000. The histograms appear jagged 
because they are constructed from only 1000 random draws from the posterior distribution in each 
case. 


Exponential model 


The exponential distribution is commonly used to model ‘waiting times’ and other continu- 
ous, positive, real-valued random variables, often measured on a time scale. The sampling 
distribution of an outcome y, given parameter 0, is 


p(y|0) = @exp(—y@), for y > 0, 


and 0 = 1/E(y|@) is called the ‘rate.’ Mathematically, the exponential is a special case of the 
gamma distribution with the parameters (a, 8) = (1,0). In this case, however, it is being 
used as a sampling distribution for an outcome y, not a prior distribution for a parameter 
0, as in the Poisson example. 

The exponential distribution has a ‘memoryless’ property that makes it a natural model 
for survival or lifetime data; the probability that an object survives an additional length of 
time t is independent of the time elapsed to this point: Pr(y > t+s |y >s, 0) = Pr(y >t |0) for 
any s,t. The conjugate prior distribution for the exponential parameter 0, as for the Poisson 
mean, is Gamma(@|a, 3) with corresponding posterior distribution Gamma(6|a+1,8+y). 
The sampling distribution of n independent exponential observations, y = (y1,---,Yn), with 
constant rate 0 is 


p(y|8) = 0” exp(—ny@), for y> 0, 


which when viewed as the likelihood of 0, for fixed y, is proportional to a Gamma(n+1, ny) 
density. Thus the Gamma(a, 3) prior distribution for 0 can be viewed as ~a—1 exponential 
observations with total waiting time 6 (see Exercise 2.19). 


2.7 Example: informative prior distribution for cancer rates 


At the end of Section 2.4, we considered the effect of the prior distribution on inference 
given a fixed quantity of data. Here, in contrast, we consider a large set of inferences, each 
based on different data but with a common prior distribution. In addition to illustrating 
the role of the prior distribution, this example introduces hierarchical modeling, to which 
we return in Chapter 5. 
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Highest kidney cancer death rates 


Figure 2.6 The counties of the United States with the highest 10% age-standardized death rates for 
cancer of kidney/ureter for U.S. white males, 1980-1989. Why are most of the shaded counties in 
the middle of the country? See Section 2.7 for discussion. 


Lowest kidney cancer death rates 


Figure 2.7 The counties of the United States with the lowest 10% age-standardized death rates for 
cancer of kidney/ureter for U.S. white males, 1980-1989. Surprisingly, the pattern is somewhat 
similar to the map of the highest rates, shown in Figure 2.6. 


A puzzling pattern in a map 


Figure 2.6 shows the counties in the United States with the highest kidney cancer death 
rates during the 1980s.! The most noticeable pattern in the map is that many of the 
counties in the Great Plains in the middle of the country, but relatively few counties near 
the coasts, are shaded. 

When shown the map, people come up with many theories to explain the dispropor- 
tionate shading in the Great Plains: perhaps the air or the water is polluted, or the people 
tend not to seek medical care so the cancers get detected too late to treat, or perhaps their 
diet is unhealthy ... These conjectures may all be true but they are not actually needed 
to explain the patterns in Figure 2.6. To see this, look at Figure 2.7, which plots the 10% 
of counties with the lowest kidney cancer death rates. These are also mostly in the middle 
of the country. So now we need to explain why these areas have the lowest, as well as the 
highest, rates. 


1The rates are age-adjusted and restricted to white males, issues which need not concern us here. 
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The issue is sample size. Consider a county of population 1000. Kidney cancer is a 
rare disease, and, in any ten-year period, a county of 1000 will probably have zero kidney 
cancer deaths, so that it will be tied for the lowest rate in the country and will be shaded 
in Figure 2.7. However, there is a chance the county will have one kidney cancer death 
during the decade. If so, it will have a rate of 1 per 10,000 per year, which is high enough 
to put it in the top 10% so that it will be shaded in Figure 2.6. The Great Plains has many 
low-population counties, and so it is overrepresented in both maps. There is no evidence 
from these maps that cancer rates are particularly high there. 


Bayesian inference for the cancer death rates 


The misleading patterns in the maps of raw rates suggest that a model-based approach to 
estimating the true underlying rates might be helpful. In particular, it is natural to estimate 
the underlying cancer death rate in each county j using the model 


y; ~ Poisson(10n;0;), (2.16) 


where yj is the number of kidney cancer deaths in county j from 1980-1989, n; is the 
population of the county, and @; is the underlying rate in units of deaths per person per 
year. In this notation, the maps in Figures 2.6 and 2.7 are plotting the raw rates, rire 
(Here we are ignoring the age-standardization, although a generalization of the model to 
allow for this would be possible.) 

This model differs from (2.14) in that 6; varies between counties, so that (2.16) is a 
separate model for each of the counties in the U.S. We use the subscript j (rather than 7) 
in (2.16) to emphasize that these are separate parameters, each being estimated from its 
own data. Were we performing inference for just one of the counties, we would simply write 
y ~ Poisson(10n6). 

To perform Bayesian inference, we need a prior distribution for the unknown rate 0j. 
For convenience we use a gamma distribution, which is conjugate to the Poisson. As we 
shall discuss later, a gamma distribution with parameters a = 20 and Ø = 430,000 is a 
reasonable prior distribution for underlying kidney cancer death rates in the counties of 
the U.S. during this period. This prior distribution has a mean of a = 4.65 x 1075 and 


standard deviation Y = 1.04 x 1075. 
The posterior distribution of 0; is then, 


6;|y; ~ Gamma(20 + yj, 430,000 + 10n,), 


which has mean and variance, 


20 + yj 
(slys) 430,000 + 10n; 
20 +y; 
var(6;|y;) = Yi 


(430,000 + 10n;)? 
The posterior mean can be viewed as a weighted average of the raw rate, oi and the 


prior mean, a= 4.65 x 107. (For a similar calculation, see Exercise 2.5.) 


Relative importance of the local data and the prior distribution 


Inference for a small county. The relative weighting of prior information and data depends 
on the population size nj. For example, consider a small county with nj = 1000: 


e For this county, if y; = 0, then the raw death rate is 0 but the posterior mean is 


20 = —5 
at = 455 x 107. 
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Figure 2.8 (a) Kidney cancer death rates y;/(10n;) vs. population size nj. (b) Replotted on the 
scale of log,, population to see the data more clearly. The patterns come from the discreteness of 
the data (nj = 0,1,2,...). 


e If yj = 1, then the raw death rate is 1 per 1000 per 10 years, or 1074 per person-year 
(about twice as high as the national mean), but the posterior mean is only ET = 
4.77 x 107°. 


e If yj = 2, then the raw death rate is an extremely high 2 x 1074 per person-year, but 
the posterior mean is still only TA = 5.00 x 1075. 


With such a small population size, the data are dominated by the prior distribution. 

But how likely, a priori, is it that y; will equal 0, 1, 2, and so forth, for this county with 
n; = 1000? This is determined by the predictive distribution, the marginal distribution 
of yj, averaging over the prior distribution of 6;. As discussed in Section 2.6, the Poisson 
model with gamma prior distribution has a negative binomial predictive distribution: 


p 
yj ~ Neg-bin (o Tom ; 


It is perhaps even simpler to simulate directly the predictive distribution of y; as follows: 
(1) draw 500 (say) values of 6; from the Gamma(20, 430,000) distribution; (2) for each of 
these, draw one value y; from the Poisson distribution with parameter 10,000 0;. Of 500 
simulations of y; produced in this way, 319 were 0’s, 141 were 1’s, 33 were 2’s, and 5 were 
3's. 

Inference for a large county. Now consider a large county with n; = 1 million. How 
many cancer deaths y; might we expect to see in a ten-year period? Again we can use 
the Gamma(20, 430,000) and Poisson(10" 6;) distributions to simulate 500 values y; from 
the predictive distribution. Doing this we found a median of 473 and a 50% interval of 
[393,545]. The raw death rate in such a county is then as likely or not to fall between 
3.93 x 1075 and 5.45 x 1075. 

What about the Bayesianly estimated or ‘Bayes-adjusted’ death rate? For example, if 
yj takes on the low value of 393, then the raw death rate is 3.93 x 10~° and the posterior 
mean of 6; is ESET = 3.96 x 1075, and if y; = 545, then the raw rate is 5.45 x 1075 
and the posterior mean is 5.41 x 1075. In this large county, the data dominate the prior 
distribution. 


Comparing counties of different sizes. In the Poisson model (2.16), the variance of ioe 
is inversely proportional to the exposure parameter nj, which can thus be considered a 
‘sample size’ for county 7. Figure 2.8 shows how the raw kidney cancer death rates vary by 
population. The extremely high and extremely low rates are all in low-population counties. 
By comparison, Figure 2.9a shows that the Bayes-estimated rates are much less variable. 


Finally, Figure 2.9b displays 50% interval estimates for a sample of counties (chosen because 
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Figure 2.9 (a) Bayes-estimated posterior mean kidney cancer death rates, E(0;|y;) = ae 
J 


us. logarithm of population size nj, the 8071 counties in the U.S. (b) Posterior medians and 50% 
intervals for 0; for a sample of 100 counties j. The scales on the y-axes differ from the plots in 
Figure 2.8b. 


it would be hard to display all 3071 in a single plot). The smaller counties supply less 
information and thus have wider posterior intervals. 


Constructing a prior distribution 


We now step back and discuss where we got the Gamma(20, 430,000) prior distribution for 
the underlying rates. As we discussed when introducing the model, we picked the gamma 
distribution for mathematical convenience. We now explain how the two parameters a, 3 
can be estimated from data to match the distribution of the observed cancer death rates 
co It might seem inappropriate to use the data to set the prior distribution, but we 
view this as a useful approximation to our preferred approach of hierarchical modeling 
(introduced in Chapter 5), in which distributional parameters such as a, 3 in this example 
are treated as unknowns to be estimated. 

Under the model, the observed count y; for any county j comes from the predictive dis- 


tribution, p(y;) = fp(y;|;)p(6;)d0;, which in this case is Neg-bin(a, w7) From Appendix 


A, we can find the mean and variance of this distribution: 


E(yj) = 10155 
var(y;) = 1055 + (10nj)° 5. (2.17) 


These can also be derived directly using the mean and variance formulas (1.8) and (1.9); 
see Exercise 2.6. 

Matching the observed mean and variance to their expectations and solving for a and 8 
yields the parameters of the prior distribution. The actual computation is more complicated 
because we must deal with the age adjustment and it also is more efficient to work with the 


mean and variance of the rates int ; 
Yj Q 
E = 2 
Yj la Q 
= ——— + —. 2.18 
a (2) In; 3° P (eI 


After dealing with the age adjustments, we equate the observed and theoretical moments, 


setting the mean of the values of “+ to © and setting the variance of the values of + 
8 8 Ton; 


10n; B 
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Figure 2.10 Empirical distribution of the age-adjusted kidney cancer death rates, = for the 8071 
I 


counties in the U.S., along with the Gamma(20, 430,000) prior distribution for the underlying cancer 
rates 0;. 


to E(L)s + a using the sample average of the values ion in place of EG) in that 
last expression. 

Figure 2.10 shows the empirical distribution of the raw cancer rates, along with the 
estimated Gamma(20, 430,000) prior distribution for the underlying cancer rates 0;. The 
distribution of the raw rates is much broader, which makes sense since they include the 
Poisson variability as well as the variation between counties. 

Our prior distribution is reasonable in this example, but this method of constructing 
it—by matching moments—is somewhat sloppy and can be difficult to apply in general. In 
Chapter 5, we discuss how to estimate this and other prior distributions in a more direct 
Bayesian manner, using hierarchical models. 

A more important way this model could be improved is by including information at the 
county level that could predict variation in the cancer rates. This would move the model 
toward a hierarchical Poisson regression of the sort discussed in Chapter 16. 


2.8 Noninformative prior distributions 


When prior distributions have no population basis, they can be difficult to construct, and 
there has long been a desire for prior distributions that can be guaranteed to play a minimal 
role in the posterior distribution. Such distributions are sometimes called ‘reference prior 
distributions,’ and the prior density is described as vague, flat, diffuse or noninformative. 
The rationale for using noninformative prior distributions is often said to be ‘to let the 
data speak for themselves,’ so that inferences are unaffected by information external to the 
current data. 

A related idea is the weakly informative prior distribution, which contains some informa- 
tion—enough to ‘regularize’ the posterior distribution, that is, to keep it roughly within rea- 
sonable bounds—but without attempting to fully capture one’s scientific knowledge about 
the underlying parameter. 


Proper and improper prior distributions 


We return to the problem of estimating the mean @ of a normal model with known variance 
o”, with a N(uo, Té) prior distribution on 0. If the prior precision, 1/7), is small relative to 
the data precision, n/o?, then the posterior distribution is approximately as if Tê = oo: 


ply) =~ N(O|y, o7/n). 
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Putting this another way, the posterior distribution is approximately that which would result 
from assuming p(0) is proportional to a constant for 6 € (—oo,00). Such a distribution is 
not strictly possible, since the integral of the assumed p(0) is infinity, which violates the 
assumption that probabilities sum to 1. In general, we call a prior density p(0) proper if it 
does not depend on data and integrates to 1. (If p(@) integrates to any positive finite value, 
it is called an unnormalized density and can be renormalized—multiplied by a constant— 
to integrate to 1.) The prior distribution is improper in this example, but the posterior 
distribution is proper, given at least one data point. 

As a second example of a noninformative prior distribution, consider the normal model 
with known mean but unknown variance, with the conjugate scaled inverse-y? prior distri- 
bution. If the prior degrees of freedom, vo, are small relative to the data degrees of freedom, 
n, then the posterior distribution is approximately as if vo = 0: 


p(o"|y) ~ Inv-x?(a7|n, v). 


This limiting form of the posterior distribution can also be derived by defining the prior 
density for ø? as p(o?) x 1/7, which is improper, having an infinite integral over the range 
(0,00). 


Improper prior distributions can lead to proper posterior distributions 


In neither of the above two examples does the prior density combine with the likelihood to 
define a proper joint probability model, p(y, 0). However, we can proceed with the algebra 
of Bayesian inference and define an unnormalized posterior density function by 


ply) x p(yl@)p(@). 


In the above examples (but not always!), the posterior density is in fact proper; that is, 
Jp(@l|y)d@ is finite for all y. Posterior distributions obtained from improper prior distri- 
butions must be interpreted with great care—one must always check that the posterior 
distribution has a finite integral and a sensible form. Their most reasonable interpretation 
is as approximations in situations where the likelihood dominates the prior density. We 
discuss this aspect of Bayesian analysis more completely in Chapter 4. 


Jeffreys’ invariance principle 


One approach that is sometimes used to define noninformative prior distributions was in- 
troduced by Jeffreys, based on considering one-to-one transformations of the parameter: 
$ = h(@). By transformation of variables, the prior density p(@) is equivalent, in terms of 
expressing the same beliefs, to the following prior density on ¢: 


ORO Fa = p(6)|n/(a) 2. (2.19) 


Jeffreys’ general principle is that any rule for determining the prior density p(@) should 
yield an equivalent result if applied to the transformed parameter; that is, p(¢@) computed 
by determining p(@) and applying (2.19) should match the distribution that is obtained by 
determining p(¢) directly using the transformed model, p(y, ¢) = p(¢)p(y|¢). 

Jeffreys’ principle leads to defining the noninformative prior density as p(0) x [J(0)]*/?, 
where J(0) is the Fisher information for 0: 


E dlog p(y|9) \ 
sy (seg) 


o) =-E SS a) . (2.20) 
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To see that Jeffreys’ prior model is invariant to parameterization, evaluate J(¢) at 0 = 


h-*(@): 


d? | 
so = -e (2wa) 
_ _p [2era =h) | a0 |? 
pE —— 
do |? 
J zo| ; 


Sgh as required. 


Jeffreys’ principle can be extended to multiparameter models, but the results are more 
controversial. Simpler approaches based on assuming independent noninformative prior 
distributions for the components of the vector parameter 0 can give different results than 
are obtained with Jeffreys’ principle. When the number of parameters in a problem is large, 
we find it useful to abandon pure noninformative prior distributions in favor of hierarchical 
models, as we discuss in Chapter 5. 


thus, J(¢)!/2 = J(0)1/2 Fa 


Various noninformative prior distributions for the binomial parameter 


Consider the binomial distribution: y ~ Bin(n, 0), which has log-likelihood 
log p(y|@) = constant + y log @ + (n — y) log(1 — 6). 
Routine evaluation of the second derivative and substitution of E(y|@) = n8 yields the 


Fisher information: Pi (v0) 
og PLY n 
0) = -E| ——— + Z 
J(9) ( do? | ) 6(1 — 8) 


Jeffreys’ prior density is then p(0) x 6~1/2(1 — @)~1/?, which is a Beta(4, 4) density. By 
comparison, recall the Bayes-Laplace uniform prior density, which can be expressed as 
0 ~ Beta(1, 1). On the other hand, the prior density that is uniform in the natural parameter 
of the exponential family representation of the distribution is p(logit(@)) « constant (see 
Exercise 2.7), which corresponds to the improper Beta(0,0) density on 6. In practice, 
the difference between these alternatives is often small, since to get from 6 ~ Beta(0, 0) 
to 6 ~ Beta(1,1) is equivalent to passing from prior to posterior distribution given one 
more success and one more failure, and usually 2 is a small fraction of the total number of 
observations. But one must be careful with the improper Beta(0,0) prior distribution—if 
y = 0 or n, the resulting posterior distribution is improper! 


Pivotal quantities 


For the binomial and other single-parameter models, different principles give (slightly) dif- 
ferent noninformative prior distributions. But for two cases—location parameters and scale 
parameters—all principles seem to agree. 


1. If the density of y is such that p(y—0|@) is a function that is free of 0 and y, say, 
f(u), where u = y — 9, then y — 0 is a pivotal quantity, and 0 is called a pure location 
parameter. In such a case, it is reasonable that a noninformative prior distribution for 6 
would give f(y—6) for the posterior distribution, p(y — |y). That is, under the posterior 
distribution, y — 0 should still be a pivotal quantity, whose distribution is free of both 
0 and y. Under this condition, using Bayes’ rule, p(y—O|y) x p(@)p(y—9|0), thereby 
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implying that the noninformative prior density is uniform on 6; that is, p(0) x constant 
over the range (—o0, 00). 

2. If the density of y is such that p(4|@) is a function that is free of 0 and y—say, g(u), where 
u=%—then u = 4 is a pivotal quantity and @ is called a pure scale parameter. In such 
a case, it is reasonable that a noninformative prior distribution for 0 would give g(4) 
for the posterior distribution, p(4|y). By transformation of variables, the conditional 
distribution of y given 0 can be expressed in terms of the distribution of u given 6, 


P(ylO) = Zpluld). 


and similarly, 
y 
p(0ly) = gaPluly).- 


After letting both p(u|@) and p(uly) equal g(u), we have the identity p(@|y) = #p(yl0). 
Thus, in this case, the reference prior distribution is p(@) œ 4 or, equivalently, p(log 0) « 1 
or p(6?) x $. 

This approach, in which the sampling distribution of the pivot is used as its posterior 
distribution, can be applied to sufficient statistics in more complicated examples, such as 
hierarchical normal models. 

Even these principles can be misleading in some problems, in the critical sense of suggest- 
ing prior distributions that can lead to improper posterior distributions. For example, the 
uniform prior density does not work for the logarithm of a hierarchical variance parameter, 
as we discuss in Section 5.4. 


Difficulties with noninformative prior distributions 


The search for noninformative priors has several problems, including: 


1. Searching for a prior distribution that is always vague seems misguided: if the likelihood 
is truly dominant in a given problem, then the choice among a range of relatively flat 
prior densities cannot matter. Establishing a particular specification as the reference 
prior distribution seems to encourage its automatic, and possibly inappropriate, use. 


2. For many problems, there is no clear choice for a vague prior distribution, since a density 
that is flat or uniform in one parameterization will not be in another. This is the 
essential difficulty with Laplace’s principle of insufficient reason—on what scale should 
the principle apply? For example, the ‘reasonable’ prior density on the normal mean 6 
above is uniform, while for 07, the density p(o?) « 1/0? seems reasonable. However, if 
we define ¢ = loga?, then the prior density on ¢ is 


do? 
do 


that is, uniform on ¢ = logø?. With discrete distributions, there is the analogous 
difficulty of deciding how to subdivide outcomes into ‘atoms’ of equal probability. 


1 

x —o* = l; 
2 
o 


p(?) = plo’) 


3. Further difficulties arise when averaging over a set of competing models that have im- 
proper prior distributions, as we discuss in Section 7.3. 


Nevertheless, noninformative and reference prior densities are often useful when it does 
not seem to be worth the effort to quantify one’s real prior knowledge as a probability 
distribution, as long as one is willing to perform the mathematical work to check that 
the posterior density is proper and to determine the sensitivity of posterior inferences to 
modeling assumptions of convenience. 
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2.9 Weakly informative prior distributions 


We characterize a prior distribution as weakly informative if it is proper but is set up so that 
the information it does provide is intentionally weaker than whatever actual prior knowledge 
is available. We will discuss this further in the context of a specific example, but in general 
any problem has some natural constraints that would allow a weakly informative model. 
For example, for regression models on the logarithmic or logistic scale, with predictors that 
are binary or scaled to have standard deviation 1, we can be sure for most applications that 
effect sizes will be less than 10, given that a difference of 10 on the log scale changes the 
expected value by a factor of exp(10) = 20,000, and on the logit scale shifts a probability 
of logit” '(—5) = 0.01 to logit~'(5) = 0.99. 

Rather than trying to model complete ignorance, we prefer in most problems to use 
weakly informative prior distributions that include a small amount of real-world information, 
enough to ensure that the posterior distribution makes sense. For example, in the sex 
ratio example from Sections 2.1 and 2.4, one could use a prior distribution concentrated 
between 0.4 and 0.6, for example N(0.5, 0.17) or, to keep the mathematical convenience of 
conjugacy, Beta(20, 20).? In the general problem of estimating a normal mean from Section 
2.5, a N(0, A?) prior distribution is weakly informative, with A set to some large value that 
depends on the context of the problem. 

In almost every real problem, the data analyst will have more information than can 
be conveniently included in the statistical model. This is an issue with the likelihood as 
well as the prior distribution. In practice, there is always compromise for a number of 
reasons: to describe the model more conveniently; because it may be difficult to express 
knowledge accurately in probabilistic form; to simplify computations; or perhaps to avoid 
using a possibly unreliable source of information. Except for the last reason, these are all 
arguments for convenience and are best justified by the claim that the answer would not 
have changed much had we been more accurate. If so few data are available that the choice 
of noninformative prior distribution makes a difference, one should put relevant information 
into the prior distribution, perhaps using a hierarchical model, as we discuss in Chapter 5. 
We return to the issue of accuracy vs. convenience in likelihoods and prior distributions in 
the examples of the later chapters. 


Constructing a weakly informative prior distribution 


One might argue that virtually all statistical models are weakly informative: a model always 
conveys some information, if only in its choice of inputs and the functional form of how 
they are combined, but it is not possible or perhaps even desirable to encode all of one’s 
prior beliefs about a subject into a set of probability distributions. With that in mind, we 
offer two principles for setting up weakly informative priors, going at the problem from two 
different directions: 


e Start with some version of a noninformative prior distribution and then add enough 
information so that inferences are constrained to be reasonable. 


e Start with a strong, highly informative prior and broaden it to account for uncertainty 
in one’s prior beliefs and in the applicability of any historically based prior distribution 
to new data. 


Neither of these approaches is pure. In the first case, it can happen that the purportedly 
noninformative prior distribution used as a starting point is in fact too strong. For example, 
if a U(0,1) prior distribution is assigned to the probability of some rare disease, then in 
the presence of weak data the probability can be grossly overestimated (suppose y = 0 


2A quick R calculation, pbeta(.6,20,20) - pbeta(.4,20,20), reveals that 80% of the probability mass 
in the Beta(20, 20) falls between 0.4 and 0.6. 
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incidences out of n = 100 cases, and the true prevalence is known to be less than 1 in 
10,000), and an appropriate weakly informative prior will be such that the posterior in this 
case will be concentrated in that low range. In the second case, a prior distribution that is 
believed to be strongly informative may in fact be too weak along some direction. This is 
not to say that priors should be made more precise whenever posterior inferences are vague; 
in many cases, our best strategy is simply to acknowledge whatever posterior uncertainty 
we have. But we should not feel constrained by default noninformative models when we 
have substantive prior knowledge available. 

There are settings, however, when it can be recommended to not use relevant informa- 
tion, even when it could clearly improve posterior inferences. The concern here is often 
expressed in terms of fairness and encoded mathematically as a symmetry principle, that 
the prior distribution should not pull inferences in any predetermined direction. For exam- 
ple, consider an experimenter studying an effect that she is fairly sure is positive; perhaps 
her prior distribution is N(0.5,0.5) on some appropriate scale. Such an assumption might 
be pefectly reasonable given current scientific information but seems potentially risky if it 
is part of the analysis of an experiment designed to test the scientist’s theory. If anything, 
one might want a prior distribution that leans against an experimenter’s hypothesis in order 
to require a higher standard of proof. 

Ultimately, such concerns can and should be subsumed into decision analysis and some 
sort of model of the entire scientific process, trading off the gains of early identification of 
large and real effects against the losses entailed in overestimating the magnitudes of effects 
and overreacting to patterns that could be attributed to chance. In the meantime, though, 
we know that statistical inferences are taken as evidence of effects, and as guides to future 
decision making, and for this purpose it can make sense to require models to have certain 
constraints such as symmetry about 0 for the prior distribution of a single treatment effect. 


2.10 Bibliographic note 


A fascinating detailed account of the early development of the idea of ‘inverse probability’ 
(Bayesian inference) is provided in the book by Stigler (1986), on which our brief accounts 
of Bayes’ and Laplace’s solutions to the problem of estimating an unknown proportion are 
based. Bayes’ famous 1763 essay in the Philosophical Transactions of the Royal Society of 
London has been reprinted as Bayes (1763); see also Laplace (1785, 1810). 

Introductory textbooks providing complementary discussions of the simple models cov- 
ered in this chapter were listed at the end of Chapter 1. In particular, Box and Tiao (1973) 
provide a detailed treatment of Bayesian analysis with the normal model and also discuss 
highest posterior density regions in some detail. The theory of conjugate prior distributions 
was developed in detail by Raiffa and Schlaifer (1961). An interesting account of inference 
for prediction, which also includes extensive details of particular probability models and 
conjugate prior analyses, appears in Aitchison and Dunsmore (1975). 

Liu et al. (2013) discuss how to efficiently compute highest posterior density intervals 
using simulations. 

Noninformative and reference prior distributions have been studied by many researchers. 
Jeffreys (1961) and Hartigan (1964) discuss invariance principles for noninformative prior 
distributions. Chapter 1 of Box and Tiao (1973) presents a straightforward and practically 
oriented discussion, a brief but detailed survey is given by Berger (1985), and the article by 
Bernardo (1979) is accompanied by a wide-ranging discussion. Bernardo and Smith (1994) 
give an extensive treatment of this topic along with many other matters relevant to the 
construction of prior distributions. Barnard (1985) discusses the relation between pivotal 
quantities and noninformative Bayesian inference. Kass and Wasserman (1996) provide a 
review of many approaches for establishing noninformative prior densities based on Jeffreys’ 
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rule, and they also discuss the problems that may arise from uncritical use of purportedly 
noninformative prior specifications. Dawid, Stone, and Zidek (1973) discuss some difficulties 
that can arise with noninformative prior distributions; also see Jaynes (1980). 

Kerman (2011) discusses noninformative and informative conjugate prior distributions 
for the binomial and Poisson models. 

Jaynes (1983) discusses in several places the idea of objectively constructing prior dis- 
tributions based on invariance principles and maximum entropy. Appendix A of Bretthorst 
(1988) outlines an objective Bayesian approach to assigning prior distributions, as applied 
to the problem of estimating the parameters of a sinusoid from time series data. More 
discussions of maximum entropy models appear in Jaynes (1982), Skilling (1989), and Gull 
(1989a); see Titterington (1984) and Donoho et al. (1992) for other views. 

For more on weakly informative prior distributions, see Gelman (2006a) and Gelman, 
Jakulin, et al. (2008). Gelman (2004b) discusses connections between parameterization and 
Bayesian modeling. Greenland (2001) discusses informative prior distributions in epidemi- 
ology. 

The data for the placenta previa example come from a study from 1922 reported in 
James (1987). For more on the challenges of estimating sex ratios from small samples, 
see Gelman and Weakliem (2009). The Bayesian analysis of age-adjusted kidney cancer 
death rates in Section 2.7 is adapted from Manton et al. (1989); see also Gelman and Nolan 
(2002a) for more on this particular example and Bernardinelli, Clayton, and Montomoli 
(1995) for a general discussion of prior distributions for disease mapping. Gelman and 
Price (1999) discuss artifacts in maps of parameter estimates, and Louis (1984), Shen and 
Louis (1998), and Louis and Shen (1999) analyze the general problem of estimation of 
ensembles of parameters, a topic to which we return in Chapter 5. 


2.11 Exercises 


1. Posterior inference: suppose you have a Beta(4, 4) prior distribution on the probability 6 
that a coin will yield a ‘head’ when spun in a specified manner. The coin is independently 
spun ten times, and ‘heads’ appear fewer than 3 times. You are not told how many heads 
were seen, only that the number is less than 3. Calculate your exact posterior density 
(up to a proportionality constant) for 0 and sketch it. 


2. Predictive distributions: consider two coins, C1 and C2, with the following characteristics: 
Pr(heads|C,) = 0.6 and Pr(heads|C2) = 0.4. Choose one of the coins at random and 
imagine spinning it repeatedly. Given that the first two spins from the chosen coin are 
tails, what is the expectation of the number of additional spins until a head shows up? 


3. Predictive distributions: let y be the number of 6’s in 1000 rolls of a fair die. 
(a) Sketch the approximate distribution of y, based on the normal approximation. 


(b) Using the normal distribution table, give approximate 5%, 25%, 50%, 75%, and 95% 
points for the distribution of y. 
4. Predictive distributions: let y be the number of 6’s in 1000 independent rolls of a par- 
ticular real die, which may be unfair. Let 0 be the probability that the die lands on ‘6.’ 
Suppose your prior distribution for 0 is as follows: 


Pr(9=1/12) = 0.25, 
Pr(9=1/6) = 0.5, 
Pr(9=1/4) = 0.25. 


(a) Using the normal approximation for the conditional distributions, p(y|@), sketch your 
approximate prior predictive distribution for y. 
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(b) Give approximate 5%, 25%, 50%, 75%, and 95% points for the distribution of y. (Be 
careful here: y does not have a normal distribution, but you can still use the normal 
distribution as part of your analysis.) 


5. Posterior distribution as a compromise between prior information and data: let y be the 
number of heads in n spins of a coin, whose probability of heads is @. 


(a) If your prior distribution for 6 is uniform on the range [0, 1], derive your prior predictive 
distribution for y, 


Pr(iy=k)= f Pr(y = k|0)d0, 


for each k = 0,1,...,n. 


(b) Suppose you assign a Beta(a, 3) prior distribution for 6, and then you observe y heads 
out of n spins. Show algebraically that your posterior mean of 0 always lies between 


your prior mean, and the observed relative frequency of heads, #. 


a 

a+B? 

(c) Show that, if the prior distribution on 6 is uniform, the posterior variance of 0 is 
always less than the prior variance. 


(d) Give an example of a Beta(a, 8) prior distribution and data y, n, in which the posterior 
variance of @ is higher than the prior variance. 


6. Predictive distributions: Derive the mean and variance (2.17) of the negative binomial 
predictive distribution for the cancer rate example, using the mean and variance formulas 
(1.8) and (1.9). 

7. Noninformative prior densities: 

(a) For the binomial likelihood, y ~ Bin(n,@), show that p(0) x 6~1(1 — 6)~+ is the 
uniform prior distribution for the natural parameter of the exponential family. 


(b) Show that if y = 0 or n, the resulting posterior distribution is improper. 


8. Normal distribution with unknown mean: a random sample of n students is drawn 
from a large population, and their weights are measured. The average weight of the n 
sampled students is Y = 150 pounds. Assume the weights in the population are normally 
distributed with unknown mean @ and known standard deviation 20 pounds. Suppose 
your prior distribution for 0 is normal with mean 180 and standard deviation 40. 


(a) Give your posterior distribution for 6. (Your answer will be a function of n.) 
(b) A new student is sampled at random from the same population and has a weight of 


y pounds. Give a posterior predictive distribution for y. (Your answer will still be a 
function of n.) 


(c) For n = 10, give a 95% posterior interval for 0 and a 95% posterior predictive interval 
for y. 
(d) Do the same for n = 100. 


9. Setting parameters for a beta prior distribution: suppose your prior distribution for 0, 
the proportion of Californians who support the death penalty, is beta with mean 0.6 and 
standard deviation 0.3. 


(a) Determine the parameters a and of your prior distribution. Sketch the prior density 
function. 


(b) A random sample of 1000 Californians is taken, and 65% support the death penalty. 
What are your posterior mean and variance for 0? Draw the posterior density function. 


(c) Examine the sensitivity of the posterior distribution to different prior means and 
widths including a non-informative prior. 
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Year Fatal Passenger Death 
accidents deaths rate 
1976 24 734 0.19 
1977 25 516 0.12 
1978 31 754 0.15 
1979 31 877 0.16 
1980 22 814 0.14 
1981 21 362 0.06 
1982 26 764 0.13 
1983 20 809 0.13 
1984 16 223 0.03 
1985 22 1066 0.15 


Table 2.2 Worldwide airline fatalities, 1976-1985. Death rate is passenger deaths per 100 million 
passenger miles. Source: Statistical Abstract of the United States. 


10. Discrete sample spaces: suppose there are N cable cars in San Francisco, numbered 
sequentially from 1 to N. You see a cable car at random; it is numbered 203. You wish 
to estimate N. (See Goodman, 1952, for a discussion and references to several versions of 
this problem, and Jeffreys, 1961, Lee, 1989, and Jaynes, 2003, for Bayesian treatments.) 


(a) Assume your prior distribution on N is geometric with mean 100; that is, 
p(N) = (1/100)(99/100)*~*, for N =1,2,.... 


What is your posterior distribution for N? 


(b) What are the posterior mean and standard deviation of N? (Sum the infinite series 
analytically or approximate them on the computer.) 


(c) Choose a reasonable ‘noninformative’ prior distribution for N and give the resulting 
posterior distribution, mean, and standard deviation for N. 


11. Computing with a nonconjugate single-parameter model: suppose y1,...,Y5 are inde- 
pendent samples from a Cauchy distribution with unknown center 0 and known scale 1: 
p(yil@) x 1/(1 + (yi — 0)?). Assume, for simplicity, that the prior distribution for 6 is 
uniform on [0,100]. Given the observations (y1,..., y5) = (43, 44, 45, 46.5, 47.5): 


(a) Compute the unnormalized posterior density function, p(@)p(y|@), on a grid of points 
d=0, +. 2, ..., 100, for some large integer m. Using the grid approximation, compute 


and plot the normalized posterior density function, p(6|y), as a function of 8. 
(b) Sample 1000 draws of 0 from the posterior density and plot a histogram of the draws. 


(c) Use the 1000 samples of 0 to obtain 1000 samples from the predictive distribution of 
a future observation, yg, and plot a histogram of the predictive draws. 


12. Jeffreys’ prior distributions: suppose y|0 ~ Poisson(@). Find Jeffreys’ prior density for 0, 
and then find a and 8 for which the Gamma(a, 3) density is a close match to Jeffreys’ 
density. 

13. Discrete data: Table 2.2 gives the number of fatal accidents and deaths on scheduled 
airline flights per year over a ten-year period. We use these data as a numerical example 
for fitting discrete data models. 


(a) Assume that the numbers of fatal accidents in each year are independent with a 
Poisson(@) distribution. Set a prior distribution for 0 and determine the posterior 
distribution based on the data from 1976 through 1985. Under this model, give a 95% 
predictive interval for the number of fatal accidents in 1986. You can use the normal 
approximation to the gamma and Poisson or compute using simulation. 
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(b) Assume that the numbers of fatal accidents in each year follow independent Poisson 
distributions with a constant rate and an exposure in each year proportional to the 
number of passenger miles flown. Set a prior distribution for 0 and determine the 
posterior distribution based on the data for 1976-1985. (Estimate the number of 
passenger miles flown in each year by dividing the appropriate columns of Table 2.2 
and ignoring round-off errors.) Give a 95% predictive interval for the number of fatal 
accidents in 1986 under the assumption that 8 x 101! passenger miles are flown that 
year. 


— 
Q 


) Repeat (a) above, replacing ‘fatal accidents’ with ‘passenger deaths.’ 
Repeat (b) above, replacing ‘fatal accidents’ with ‘passenger deaths.’ 


ea 
Scat as 


In which of the cases (a)—(d) above does the Poisson model seem more or less rea- 
sonable? Why? Discuss based on general principles, without specific reference to the 
numbers in Table 2.2. 
Incidentally, in 1986, there were 22 fatal accidents, 546 passenger deaths, and a death 
rate of 0.06 per 100 million miles flown. We return to this example in Exercises 3.12, 
6.2, 6.3, and 8.14. 
14. Algebra of the normal model: 
(a) Fill in the steps to derive (2.9)—(2.10), and (2.11)—(2.12). 
(b) Derive (2.11) and (2.12) by starting with a N(uo,7¢) prior distribution and adding 
data points one at a time, using the posterior distribution at each step as the prior 
distribution for the next. 


15. Beta distribution: assume the result, from standard advanced calculus, that 


PE -al da= Tr(a) (b) 
I OE Tea By 


If Z has a beta distribution with parameters a and £, find E[Z™(1 — Z)"] for any non- 
negative integers m and n. Hence derive the mean and variance of Z. 


16. Beta-binomial distribution and Bayes’ prior distribution: suppose y has a binomial dis- 
tribution for given n and unknown parameter 0, where the prior distribution of 0 is 
Beta(a, 2). 


(a) Find p(y), the marginal distribution of y, for y = 0,...,n (unconditional on 0). This 
discrete distribution is known as the beta-binomial, for obvious reasons. 


(b) Show that if the beta-binomial probability is constant in y, then the prior distribution 
has to havea = 6 = 1. 


17. Posterior intervals: unlike the central posterior interval, the highest posterior interval 
is not invariant to transformation. For example, suppose that, given a”, the quantity 
nv/o? is distributed as x2, and that ø has the (improper) noninformative prior density 
p(o)xal,o>0. 


(a) Prove that the corresponding prior density for a? is p(o?) x o~?. 


(b) Show that the 95% highest posterior density region for ø? is not the same as the region 
obtained by squaring the endpoints of a posterior interval for ø. 
18. Poisson model: derive the gamma posterior distribution (2.15) for the Poisson model 
parameterized in terms of rate and exposure with conjugate prior distribution. 
19. Exponential model with conjugate prior distribution: 
(a) Show that if y|@ is exponentially distributed with rate 0, then the gamma prior dis- 
tribution is conjugate for inferences about @ given an independent and identically 
distributed sample of y values. 
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(b) Show that the equivalent prior specification for the mean, ¢ = 1/6, is inverse-gamma. 
(That is, derive the latter density function.) 

(c) The length of life of a light bulb manufactured by a certain process has an exponential 
distribution with unknown rate 0. Suppose the prior distribution for 0 is a gamma 
distribution with coefficient of variation 0.5. (The coefficient of variation is defined 
as the standard deviation divided by the mean.) A random sample of light bulbs is 
to be tested and the lifetime of each obtained. If the coefficient of variation of the 
distribution of 0 is to be reduced to 0.1, how many light bulbs need to be tested? 

(d) In part (c), if the coefficient of variation refers to ¢ instead of 6, how would your 
answer be changed? 


20. Censored and uncensored data in the exponential model: 


(a) Suppose y| is exponentially distributed with rate 0, and the marginal (prior) distri- 
bution of 0 is Gamma(a, 8). Suppose we observe that y > 100, but do not observe 
the exact value of y. What is the posterior distribution, p(6|y> 100), as a function of 
a and 6? Write down the posterior mean and variance of 0. 

(b) In the above problem, suppose that we are now told that y is exactly 100. Now what 
are the posterior mean and variance of 6? 

(c) Explain why the posterior variance of @ is higher in part (b) even though more in- 
formation has been observed. Why does this not contradict identity (2.8) on page 
32? 

21. Simple hierarchical modeling: 

The file pew_research_center_june_elect_wknd_data.dta® has data from Pew Research 

Center polls taken during the 2008 election campaign. You can read these data into R 

using the read.dta() function (after first loading the foreign package into R). 

Your task is to estimate the percentage of the (adult) population in each state (excluding 

Alaska, Hawaii, and the District of Columbia) who label themselves as ‘very liberal,’ 

following the general procedure that was used in Section 2.7 to estimate cancer rates, 

but using the binomial and beta rather than Poisson and gamma distributions. But you 

do not need to make maps; it will be enough to make scatterplots, plotting the estimate 

vs. Barack Obama’s vote share in 2008 (data available at 2008ElectionResult.csv, 

readable in R using read.csv()). 

Make the following four graphs on a single page: 

e Graph proportion very liberal among the survey respondents in each state vs. Obama 
vote share—that is, a scatterplot using the two-letter state abbreviations (see state. abb() 
in R). 

e Graph the Bayes posterior mean in each state vs. Obama vote share. 

e Repeat graphs (a) and (b) using the number of respondents in the state on the z-axis. 

This exercise has four challenges: first, manipulating the data in order to get the totals 

by state; second, estimating the parameters of the prior distribution; third, doing the 

Bayesian analysis by state; and fourth, making the graphs. 


22. Prior distributions: 

A (hypothetical) study is performed to estimate the effect of a simple training program 
on basketball free-throw shooting. A random sample of 100 college students is recruited 
into the study. Each student first shoots 100 free-throws to establish a baseline success 
probability. Each student then takes 50 practice shots each day for a month. At the end 
of that time, he or she takes 100 shots for a final measurement. Let 0 be the average 
improvement in success probability. 

Give three prior distributions for 0 (explaining each in a sentence): 


3For data for this and other exercises, go to http://www.stat.columbia.edu/~gelman/book/. 
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(a) A noninformative prior, 
(b) A subjective prior based on your best knowledge, and 
(c) A weakly informative prior. 
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Chapter 3 


Introduction to multiparameter models 


Virtually every practical problem in statistics involves more than one unknown or unob- 
servable quantity. It is in dealing with such problems that the simple conceptual framework 
of the Bayesian approach reveals its principal advantages over other methods of inference. 
Although a problem can include several parameters of interest, conclusions will often be 
drawn about one, or only a few, parameters at a time. In this case, the ultimate aim of a 
Bayesian analysis is to obtain the marginal posterior distribution of the particular param- 
eters of interest. In principle, the route to achieving this aim is clear: we first require the 
joint posterior distribution of all unknowns, and then we integrate this distribution over the 
unknowns that are not of immediate interest to obtain the desired marginal distribution. 
Or equivalently, using simulation, we draw samples from the joint posterior distribution 
and then look at the parameters of interest and ignore the values of the other unknowns. 
In many problems there is no interest in making inferences about many of the unknown 
parameters, although they are required in order to construct a realistic model. Parameters 
of this kind are often called nuisance parameters. A classic example is the scale of the 
random errors in a measurement problem. 

We begin this chapter with a general treatment of nuisance parameters and then cover 
the normal distribution with unknown mean and variance in Section 3.2. Sections 3.4 
and 3.5 present inference for the multinomial and multivariate normal distributions—the 
simplest models for discrete and continuous multivariate data, respectively. The chapter 
concludes with an analysis of a nonconjugate logistic regression model, using numerical 
computation of the posterior density on a grid. 


3.1 Averaging over ‘nuisance parameters’ 


To express the ideas of joint and marginal posterior distributions mathematically, suppose 
6 has two parts, each of which can be a vector, 0 = (01,82), and further suppose that we 
are only interested (at least for the moment) in inference for 01, so 02 may be considered a 
‘nuisance’ parameter. For instance, in the simple example, 


y|u,o? ~ N(u,07), 


in which both u (=‘0,’) and o? (=‘02’) are unknown, interest commonly centers on p. 
We seek the conditional distribution of the parameter of interest given the observed 
data; in this case, p(0,|y). This is derived from the joint posterior density, 


P(A1, P2|y) x p(y|A1, 02)p(01, 02), 
by averaging over 02: 


p(Oly) = J (01, O2ly)db2. 


63 
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Alternatively, the joint posterior density can be factored to yield 


ply) = I AE abe, (3.1) 


which shows that the posterior distribution of interest, p(01|y), is a mixture of the condi- 
tional posterior distributions given the nuisance parameter, 62, where p(62|y) is a weighting 
function for the different possible values of 02. The weights depend on the posterior density 
of 02 and thus on a combination of evidence from data and prior model. The averaging over 
nuisance parameters 62 can be interpreted generally; for example, 62 can include a discrete 
component representing different possible sub-models. 

We rarely evaluate the integral (3.1) explicitly, but it suggests an important practical 
strategy for both constructing and computing with multiparameter models. Posterior dis- 
tributions can be computed by marginal and conditional simulation, first drawing 02 from 
its marginal posterior distribution and then 6, from its conditional posterior distribution, 
given the drawn value of 62. In this way the integration embodied in (3.1) is performed 
indirectly. A canonical example of this form of analysis is provided by the normal model 
with unknown mean and variance, to which we now turn. 


3.2 Normal data with a noninformative prior distribution 


As the prototype example of estimating the mean of a population from a sample, we consider 
a vector y of n independent observations from a univariate normal distribution, N(, 07); 
the generalization to the multivariate normal distribution appears in Section 3.5. We begin 
by analyzing the model under a noninformative prior distribution, with the understanding 
that this is no more than a convenient assumption for the purposes of exposition and is 
easily extended to informative prior distributions. 


A noninformative prior distribution 


We saw in Chapter 2 that a sensible vague prior density for u and ø, assuming prior 
independence of location and scale parameters, is uniform on (1, logo) or, equivalently, 


plu, 07) x (07). 
The joint posterior distribution, p(u,07|y) 


Under this conventional improper prior density, the joint posterior distribution is propor- 
tional to the likelihood function multiplied by the factor 1/c?: 
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where 


is the sample variance of the y;’s. The sufficient statistics are 7 and s?. 
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The conditional posterior distribution, p(u|o7, y) 


In order to factor the joint posterior density as in (3.1), we consider first the conditional 
posterior density, p(ul|o?, y), and then the marginal posterior density, p(o7|y). To determine 
the posterior distribution of u, given 07, we simply use the result derived in Section 2.5 for 
the mean of a normal distribution with known variance and a uniform prior distribution: 


ulo? y ~ NG, o7/n). (3.3) 


The marginal posterior distribution, p(o?|y) 


To determine p(c?|y), we must average the joint distribution (3.2) over u: 


pl(o?ly) x fo" exp (l-0 +n- n?) du. 


Integrating this expression over u requires evaluating the integral exp (—sizn(y - WP’), 


which is a simple normal integral; thus, 


1 
p(a*|y) x oa "exp (0-0) V 2710? /n 
o 


1) @2 
x (a?) tD exp (==) l (3.4) 
20 


which is a scaled inverse-x? density: 
o*|y ~ Inv-x?(n — 1,87). (3.5) 


We have thus factored the joint posterior density (3.2) as the product of conditional and 
marginal posterior densities: p(u, 0?|y) = p(ulo?, y)plo?ly). 

This marginal posterior distribution for g? has a remarkable similarity to the analogous 
sampling theory result: conditional on g? (and pu), the distribution of the appropriately 
scaled sufficient statistic, cea is y2_,. Considering our derivation of the reference prior 
distribution for the scale parameter in Section 2.8, however, this result is not surprising. 


Sampling from the joint posterior distribution 


It is easy to draw samples from the joint posterior distribution: first draw o? from (3.5), 
then draw u from (3.3). We also derive some analytical results for the posterior distribution, 
since this is one of the few multiparameter problems simple enough to solve in closed form. 


Analytic form of the marginal posterior distribution of | 


The population mean, ju, is typically the estimand of interest, and so the objective of the 
Bayesian analysis is the marginal posterior distribution of u, which can be obtained by 
integrating o? out of the joint posterior distribution. The representation (3.1) shows that 
the posterior distribution of u can be regarded as a mixture of normal distributions, mixed 
over the scaled inverse-? distribution for the variance, 07. We can derive the marginal 
posterior density for u by integrating the joint posterior density over 07: 


pluly) = f "ieiet 
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This integral can be evaluated using the substitution 


z where A = (n — 1)? + n(u — 9)’, 


~ a2? 


and recognizing that the result is an unnormalized gamma integral: 


pluly) œ AT? J aN exp(—2z)dz 
0 
x [(n— Ns? + nlu- p 


x pe te i 


This is the tn-1(7, s/n) density (see Appendix A). 
To put it another way, we have shown that, under the noninformative uniform prior 
distribution on (u, loga), the posterior distribution of u has the form 


M-Y 
where tn—1 denotes the standard t density (location 0, scale 1) with n—1 degrees of freedom. 


This marginal posterior distribution provides another interesting comparison with sampling 
theory. Under the sampling distribution, p(y|u, 07), the following relation holds: 


yx tn-1,; 


Y-H 
s/yn 
The sampling distribution of the pivotal quantity (J — u)/(s/vn) does not depend on the 
nuisance parameter o7, and its posterior distribution does not depend on data. In general, 


a pivotal quantity for the estimand is defined as a nontrivial function of the data and the 
estimand whose sampling distribution is independent of all parameters and data. 


2 
| L, o^ ~N tn-1. 


Posterior predictive distribution for a future observation 


The posterior predictive distribution for a future observation, y, can be written as a mixture, 
pglu) = ff p@\u, o”, y)p(u, o°ly)dudo?. The first of the two factors in the integral is just 
the normal distribution for the future observation given the values of (u, o°), and does not 
depend on y at all. To draw from the posterior predictive distribution, first draw u, o? from 
their joint posterior distribution and then simulate 7 ~ N(u, 07). 

In fact, the posterior predictive distribution of y is a t distribution with location J, 
scale (1 + +)!/2s, and n — 1 degrees of freedom. This analytic form is obtained using the 
same techniques as in the derivation of the posterior distribution of u. Specifically, the 
distribution can be obtained by integrating out the parameters u,g? according to their 
joint posterior distribution. We can identify the result more easily by noticing that the 
factorization p(jlo?,y) = f pglu, o”, y)p(ulo?, y)dy leads to p(glo?, y) = N(gly, (1 + 4)o”), 
which is the same, up to a changed scale factor, as the distribution of u|o?, y. 

Example. Estimating the speed of light 

Simon Newcomb set up an experiment in 1882 to measure the speed of light. Newcomb 

measured the amount of time required for light to travel a distance of 7442 meters. A 

histogram of Newcomb’s 66 measurements is shown in Figure 3.1. There are two un- 

usually low measurements and then a cluster of measurements that are approximately 
symmetrically distributed. We (inappropriately) apply the normal model, assuming 
that all 66 measurements are independent draws from a normal distribution with mean 
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Figure 3.1 Histogram of Simon Newcomb’s measurements for estimating the speed of light, from 
Stigler (1977). The data are recorded as deviations from 24,800 nanoseconds. 


u and variance o?. The main substantive goal is posterior inference for u. The outlying 
measurements do not fit the normal model; we discuss Bayesian methods for measur- 
ing the lack of fit for these data in Section 6.3. The mean of the 66 measurements is 
y = 26.2, and the sample standard deviation is s = 10.8. Assuming the noninformative 
prior distribution p(u, o°) x (o?)~+, a 95% central posterior interval for u is obtained 
from the tgs marginal posterior distribution of u as Y + 1.997s/V/66 = (23.6, 28.8]. 
The posterior interval can also be obtained by simulation. Following the factorization 
of the posterior distribution given by (3.5) and (3.3), we first draw a random value of 
o? ~ Inv-y?(65, s?) as 65s? divided by a random draw from the xĉs distribution (see 
Appendix A). Then given this value of 07, we draw u from its conditional posterior 
distribution, N(26.2, 07/66). Based on 1000 simulated values of (1,07), we estimate 
the posterior median of u to be 26.2 and a 95% central posterior interval for u to be 
(23.6, 28.9], close to the analytically calculated interval. 

Incidentally, based on the currently accepted value of the speed of light, the ‘true 
value’ for u in Newcomb’s experiment is 33.0, which falls outside our 95% interval. 
This reinforces the fact that posterior inferences are only as good as the model and 
the experiment that produced the data. 


3.3 Normal data with a conjugate prior distribution 
A family of conjugate prior distributions 


A first step toward a more general model is to assume a conjugate prior distribution for 
the two-parameter univariate normal sampling model in place of the noninformative prior 
distribution just considered. The form of the likelihood displayed in (3.2) and the subse- 
quent discussion shows that the conjugate prior density must also have the product form 
p(o”)p(u\o7), where the marginal distribution of o? is scaled inverse-y? and the conditional 
distribution of u given ø? is normal (so that marginally u has a t distribution). A convenient 
parameterization is given by the following specification: 


ulo? ~ N(u0,0°/fo) 
aw Inv-x? (vo, 08), 


which corresponds to the joint prior density 
— “iy 1 
plpa 0?)  o-*(02) -00/24 exp (Ta lvoo? + rolo — n?) - (3.6) 


We label this the N-Inv-y?(1, o°| uo, oĉ /kKo; vo, 72) density; its four parameters can be iden- 
tified as the location and scale of u and the degrees of freedom and scale of 7, respectively. 
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The appearance of ø? in the conditional distribution of jo? means that u and o? are 
necessarily dependent in their joint conjugate prior density: for example, if ø? is large, then 
a high-variance prior distribution is induced on u. This dependence is notable, considering 
that conjugate prior distributions are used largely for convenience. Upon reflection, however, 
it often makes sense for the prior variance of the mean to be tied to a”, which is the sampling 
variance of the observation y. In this way, prior belief about u is calibrated by the scale of 
measurement of y and is equivalent to Kg prior measurements on this scale. 


The joint posterior distribution, p(u,07|y) 


Multiplying the prior density (3.6) by the normal likelihood yields the posterior density 
1 
Plas aly) x oo)! exp ( -z [voo + rolu — mo)?l) x 
o 


1 
x (o?) 7"? exp (-salte —1)s? +nG- wl) (3.7) 
= N-Inv-x?(p, 0? | Un, 02 /Kn} Un, 07), 


where, after some algebra (see Exercise 3.9), it can be shown that 


E: Ko $ = 
Hn = Beene En” 
Kn = Kotn 
Vn = Ytn 
Kon 
mon = woog + (n— 1)? + ——(G— fo)”. 
0 


The parameters of the posterior distribution combine the prior information and the infor- 
mation contained in the data. For example un is a weighted average of the prior mean and 
the sample mean, with weights determined by the relative precision of the two pieces of 
information. The posterior degrees of freedom, vn, is the prior degrees of freedom plus the 
sample size. The posterior sum of squares, v;,02, combines the prior sum of squares, the 
sample sum of squares, and the additional uncertainty conveyed by the difference between 
the sample mean and the prior mean. 


The conditional posterior distribution, p(u|o7, y) 


The conditional posterior density of u, given a”, is proportional to the joint posterior density 
(3.7) with o? held constant, 


plo? y ~ Nin, o7/Kn) 


Sout Ay 1 
Ya E p (3.8) 
ata ot t oF 


which agrees, as it must, with the analysis in Section 2.5 of u with o considered fixed. 


The marginal posterior distribution, p(o?|y) 


The marginal posterior density of o°, from (3.7), is scaled inverse-y?: 


oly ~ Inv-x?(Un, on): (3.9) 
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Sampling from the joint posterior distribution 


To sample from the joint posterior distribution, just as in the previous section, we first draw 
o” from its marginal posterior distribution (3.9), then draw u from its normal conditional 
posterior distribution (3.8), using the simulated value of o?. 


Analytic form of the marginal posterior distribution of | 


Integration of the joint posterior density with respect to a”, in a precisely analogous way 
to that used in the previous section, shows that the marginal posterior density for yu is 


Kin (UL a ial: 


noZ 


— (Vn +1)/2 
E (1 + ) 


= ty, (ulun, o2] Kn). 


3.4 Multinomial model for categorical data 


The binomial distribution that was emphasized in Chapter 2 can be generalized to allow 
more than two possible outcomes. The multinomial sampling distribution is used to describe 
data for which each observation is one of k possible outcomes. If y is the vector of counts 
of the number of observations of each outcome, then 


k 

Yj 

p(yld) x [J a, 
j=l 


where the sum of the probabilities, ae 0j, is 1. The distribution is typically thought of as 


implicitly conditioning on the number of observations, De yj = n. The conjugate prior 
distribution is a multivariate generalization of the beta distribution known as the Dirichlet, 


k 

aj—1 

pla) x [[ 0%, 
j=1 


where the distribution is restricted to nonnegative 0;’s with DS 0; = 1; see Appendix 
A for details. The resulting posterior distribution for the 6;’s is Dirichlet with parameters 
Aj + Yj- 

The prior distribution expressed on the scale of a is mathematically equivalent to a 
likelihood resulting from Ya (a; — 1) observations with a; — 1 observations of the jth out- 
come category. As in the binomial there are several plausible noninformative Dirichlet prior 
distributions. A uniform density is obtained by setting a; = 1 for all j; this distribution 
assigns equal density to any vector 0 satisfying Si 0j = 1. Setting a; = 0 for all 7 results 
in an improper prior distribution that is uniform in the log(@;)’s. The resulting posterior 
distribution is proper if there is at least one observation in each of the k categories, so that 
each component of y is positive. The bibliographic note at the end of this chapter points 
to other suggested noninformative prior distributions for the multinomial model. 


Example. Pre-election polling 

For a simple example of a multinomial model, we consider a sample survey question 
with three possible responses. In late October, 1988, a survey was conducted by CBS 
News of 1447 adults in the United States to find out their preferences in the upcoming 
presidential election. Out of 1447 persons, yı = 727 supported George Bush, y2 = 583 
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Figure 3.2 Histogram of values of (01 — 62) for 1000 simulations from the posterior distribution for 
the election polling example. 


supported Michael Dukakis, and y3 = 137 supported other candidates or expressed no 
opinion. Assuming no other information on the respondents, the 1447 observations 
are exchangeable. If we also assume simple random sampling (that is, 1447 names 
‘drawn out of a hat’), then the data (y1, y2, y3) follow a multinomial distribution, with 
parameters (01, 62,63), the proportions of Bush supporters, Dukakis supporters, and 
those with no opinion in the survey population. An estimand of interest is 6; — 02, 
the population difference in support for the two major candidates. 

With a noninformative uniform prior distribution on 0, ay =a2z=a3=1, the posterior 
distribution for (01, 02,03) is Dirichlet(728, 584, 138). We could compute the posterior 
distribution of 6; — 02 by integration, but it is simpler just to draw 1000 points 
(01,02,03) from the posterior Dirichlet distribution and then compute 6; — 02 for 
each. The result is displayed in Figure 3.2. All of the 1000 simulations had 0; > 63; 
thus, the estimated posterior probability that Bush had more support than Dukakis 
in the survey population is over 99.9%. 

In fact, the CBS survey does not use independent random sampling but rather uses a 
variant of a stratified sampling plan. We discuss an improved analysis of this survey, 
using some knowledge of the sampling scheme, in Section 8.3 (see Table 8.2 on page 
207). 


In complicated problems—for example, analyzing the results of many survey questions 
simultaneously—the number of multinomial categories, and thus parameters, becomes so 
large that it is hard to usefully analyze a dataset of moderate size without additional 
structure in the model. Formally, additional information can enter the analysis through 
the prior distribution or the sampling model. An informative prior distribution might be 
used to improve inference in complicated problems, using the ideas of hierarchical modeling 
introduced in Chapter 5. Alternatively, loglinear models can be used to impose structure on 
multinomial parameters that result from cross-classifying several survey questions; Section 
16.7 provides details and an example. 


3.5 Multivariate normal model with known variance 


Here we give a somewhat formal account of the distributional results of Bayesian inference 
for the parameters of a multivariate normal distribution. In many ways, these results 
parallel those already given for the univariate normal model, but there are some important 
new aspects that play a major role in the analysis of linear models, which is the central 
activity of much applied statistical work (see Chapters 5, 14, and 15). This section can be 
viewed at this point as reference material for future chapters. 
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Multivariate normal likelihood 


The basic model to be discussed concerns an observable vector y of d components, with the 
multivariate normal distribution, 


yla, © ~ N(u, 2), (3.10) 


where p is a (column) vector of length d and © is a dx d variance matrix, which is symmetric 
and positive definite. The likelihood function for a single observation is 


plylu, £) x |E] 7"? exp (-Su — u) E (y — W) ; 


and for a sample of n independent and identically distributed observations, y1,...,Yn, is 
a i 2 
P(Y, +++ Yl, £) œ |E"? exp (3 Sou — u) I ye - 0) . (3.11) 
i=1 


Conjugate analysis 
As with the univariate normal model, we analyze the multivariate normal model by first 
considering the case of known X. 


Conjugate prior distribution for u with known X. The log-likelihood is a quadratic form 
in u, and therefore the conjugate prior distribution for u is the multivariate normal distri- 
bution, which we parameterize as u ~ N(uo, Ao). 


Posterior distribution for u with known X. The posterior distribution of u is 
1 - Z = 
pluly, £) x exp (-} (o — po)” Ag (u — Ho) + X (mi — u) E {yi — »)) 
i=1 


which is an exponential of a quadratic form in u. Completing the quadratic form and pulling 
out constant factors (see Exercise 3.13) gives 


pluly, E) œx exp (-30 = Un) Apt (u — Hn)) 


= N(p lun, An), 
where 
Hin = (Ag? +n AS po + nE’) 
AW’ = Aj’ ene. (3.12) 


These are similar to the results for the univariate normal model in Section 2.5, the posterior 
mean being a weighted average of the data and the prior mean, with weights given by the 
data and prior precision matrices, nu~! and Ag 1 respectively. The posterior precision is 
the sum of the prior and data precisions. 


Posterior conditional and marginal distributions of subvectors of 4 with known ©. It follows 
from the properties of the multivariate normal distribution (see Appendix A) that the 
marginal posterior distribution of a subset of the parameters, 1“) say, is also multivariate 
normal, with mean vector equal to the appropriate subvector of the posterior mean vector 
Un and variance matrix equal to the appropriate submatrix of An. Also, the conditional 
posterior distribution of a subset “) given the values of a second subset u?) is multivariate 
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normal. If we write superscripts in parentheses to indicate appropriate subvectors and 
submatrices, then 


pO |p, y N (uP + BU? — ul), A2), (3.13) 
where the regression coefficients 81!? and conditional variance matrix A}? are defined by 
giz = AQ?) (aga) 
Al2 = AG) = AC?) (a2?) "ag. 


Posterior predictive distribution for new data. We now work out the analytic form of the 
posterior predictive distribution for a new observation y ~ N(u, £). As with the univariate 
normal, we first note that the joint distribution, p(y, uly) = N(Y|u, E)N (ulun, An), is the 
exponential of a quadratic form in (g, p); hence (g, p) have a joint normal posterior distri- 
bution, and so the marginal posterior distribution of y is (multivariate) normal. We are 
still assuming the variance matrix © is known. As in the univariate case, we can determine 
the posterior mean and variance of ġ using (2.7) and (2.8): 


Ely) = E(E(ğlu, y)ly) 
= E(uly) = bn, 
and 
var(gly) = E(var(ğl|u, y)ly) + var(E(ğlu, y)ly) 


= E(S|y) + var(uly) = E + An. 


To sample from the posterior distribution or the posterior predictive distribution, re- 
fer to Appendix A for a method of generating random draws from a multivariate normal 
distribution with specified mean and variance matrix. 


Noninformative prior density for u. A noninformative uniform prior density for u is p(u) x 
constant, obtained in the limit as the prior precision tends to zero in the sense |Ap 1) > 0; 
in the limit of infinite prior variance (zero prior precision), the prior mean is irrelevant. 
Though this choice of prior density does not combine with the likelihood to form a proper 
joint probability model for u and y, the posterior density obtained by applying Bayes’ rule 
is a proper posterior density. The posterior density is proportional to the likelihood (3.11) 
which is an exponential of a quadratic form in u. Completing the quadratic form and pulling 
out constant terms yields the posterior distribution for u, given the uniform prior density, 
as |X, y ~ NG, D/n). 


3.6 Multivariate normal with unknown mean and variance 
Conjugate inverse-Wishart family of prior distributions 


Recall that the conjugate distribution for the univariate normal with unknown mean and 
variance is the normal-inverse-y? distribution (3.6). We can use the inverse-Wishart dis- 
tribution, a multivariate generalization of the scaled inverse-y?, to describe the prior dis- 
tribution of the matrix ©. The conjugate prior distribution for (u, X), the normal-inverse- 
Wishart, is conveniently parameterized in terms of hyperparameters (uo, Ao/o; vo, Ao): 


© ~  Inv-Wishart,,(Aq*) 
udu fi N(uo, &/Ko), 
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which corresponds to the joint prior density 
“(Wy 1 = Ko = 
plas E) [BHAD exp(—Ste(AoB—) — D(a — pa)?" = so). 


The parameters vo and Ao describe the degrees of freedom and the scale matrix for the 
inverse-Wishart distribution on X. The remaining parameters are the prior mean, fio, and 
the number of prior measurements, Ko, on the X scale. Multiplying the prior density by the 
normal likelihood results in a posterior density of the same family with parameters 


_ Ko + Z 
Hn = PET Ko +n” 
Kn = Kotn 
Vy, = Vorn 
Kon 7 = 
An = Aot+tS+ (J — uo)(g — Ho)”, 
Ko +n 


where S$ is the sum of squares matrix about the sample mean, 
n 
S= X ui -7v 7)". 
i=1 


Other results from the univariate normal easily generalize to the multivariate case. The 
marginal posterior distribution of u is multivariate ty„—d+1(Hn, An/(Kn(Vn — d + 1))). The 
posterior predictive distribution of a new observation y is also multivariate t with an ad- 
ditional factor of kn +1 in the numerator of the scale matrix. Samples from the joint 
posterior distribution of (4,¥) are easily obtained using the following procedure: first, 
draw S|y ~ Inv-Wishart,, (Aj 1), then draw p|X,y ~ N(un,U/kn). See Appendix A for 
drawing from inverse-Wishart and multivariate normal distributions. To draw from the 
posterior predictive distribution of a new observation, draw gļu, X, y ~ N(u, £), given the 
already drawn values of u and X. 


Different noninformative prior distributions 


Inverse-Wishart with d+ 1 degrees of freedom. Setting © ~ Inv-Wisharta+ı(7) has the 
appealing feature that each of the correlations in © has, marginally, a uniform prior distri- 
bution. (The joint distribution is not uniform, however, because of the constraint that the 
correlation matrix be positive definite.) 


Inverse-Wishart with d—1 degrees of freedom. Another proposed noninformative prior 
distribution is the multivariate Jeffreys prior density, 


plu D) o ED, 


which is the limit of the conjugate prior density as kọ > 0, vo > —1, |Ao| —> 0. The 
corresponding posterior distribution can be written as 


Ely ~ Inv-Wishart,—1(S7') 
Hy ~ NG,X/n). 
Results for the marginal distribution of u and the posterior predictive distribution of y, 


assuming that the posterior distribution is proper, follow from the previous paragraph. For 
example, the marginal posterior distribution of u is multivariate tn-a(, S/ (n(n — d))). 
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Dose, zi Number of | Number of 
(log g/ml) animals, n; deaths, y; 
—0.86 5 0 
—0.30 5 1 
—0.05 5 3 
0.73 5 5 


Table 3.1: Bioassay data from Racine et al. (1986). 


Scaled inverse- Wishart model 


When modeling covariance matrices it can help to extend the inverse-Wishart model by 
multiplying by a set of scale parameters that can be modeled separately. This gives flexibility 
in modeling and allows one to set up a uniform or weak prior distribution on correlations 
without overly constraining the variance parameters. The scaled inverse- Wishart model for 
X has the form, 


£ = Diag(€)X,,Diag(€), 


where X, is given an inverse-Wishart prior distribution (one choice is Inv-Wisharta+1 (I), so 
that the marginal distributions of the correlations are uniform) and then the scale param- 
eters € can be given weakly informative priors themselves. We discuss further in Section 
15.4 in the context of varying-intercept, varying-slope hierarchical regression models. 


3.7 Example: analysis of a bioassay experiment 


Beyond the normal distribution, few multiparameter sampling models allow simple explicit 
calculation of posterior distributions. Data analysis for such models is possible using the 
computational methods described in Part III of this book. Here we present an example 
of a nonconjugate model for a bioassay experiment, drawn from the literature on applied 
Bayesian statistics. The model is a two-parameter example from the broad class of general- 
ized linear models to be considered more thoroughly in Chapter 16. We use a particularly 
simple simulation approach, approximating the posterior distribution by a discrete distri- 
bution supported on a two-dimensional grid of points, that provides sufficiently accurate 
inferences for this two-parameter example. 


The scientific problem and the data 


In the development of drugs and other chemical compounds, acute toxicity tests or bioassay 
experiments are commonly performed on animals. Such experiments proceed by adminis- 
tering various dose levels of the compound to batches of animals. The animals’ responses 
are typically characterized by a dichotomous outcome: for example, alive or dead, tumor 
or no tumor. An experiment of this kind gives rise to data of the form 


(£i, ni, yi); i= Lying hj 


where x; represents the ith of k dose levels (often measured on a logarithmic scale) given 
to n; animals, of which y; subsequently respond with positive outcome. An example of real 
data from such an experiment is shown in Table 3.1: twenty animals were tested, five at 
each of four dose levels. 


Modeling the dose-response relation 


Given what we have seen so far, we must model the outcomes of the five animals within 
each group i as exchangeable, and it seems reasonable to model them as independent with 
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equal probabilities, which implies that the data points y; are binomially distributed: 


yilði ~ Bin(nj, 0:), 


where 6; is the probability of death for animals given dose x;. (An example of a situation 
in which independence and the binomial model would not be appropriate is if the deaths 
were caused by a contagious disease.) For this experiment, it is also reasonable to treat the 
outcomes in the four groups as independent of each other, given the parameters 6),..., 04. 

The simplest analysis would treat the four parameters 6; as exchangeable in their prior 
distribution, perhaps using a noninformative density such as p(61,...,64) x 1, in which case 
the parameters 0; would have independent beta posterior distributions. The exchangeable 
prior model for the 6; parameters has a serious flaw, however; we know the dose level x; 
for each group i, and one would expect the probability of death to vary systematically as a 
function of dose. 

The simplest model of the dose-response relation—that is, the relation of 0; to 7;—is 
linear: 0; = a+ xi. Unfortunately, this model has the flaw that at low or high doses, 
zi approaches +00 (recall that the dose is measured on the log scale), whereas 0;, being a 
probability, must be constrained to lie between 0 and 1. The standard solution is to use a 
transformation of the 6’s, such as the logistic, in the dose-response relation: 


logit(0;) = a + Bai, (3.14) 
where logit(0;) = log(0;/(1 — 6;)) as defined in (1.10). This is called a logistic regression 


model. 


The likelihood 


Under the model (3.14), we can write the sampling distribution, or likelihood, for each 
group 7 in terms of the parameters a and ĝ as 


plyila, B, ni, xi) x [logit (a + Bx;)]” [1 — logit (a + br). 


The model is characterized by the parameters a and 8, whose joint posterior distribution 
is 


pla, Bly,n,x) x pla, pln, x)plyla, B,n, x) (3.15) 
k 
i=l 


We consider the sample sizes n; and dose levels x; as fixed for this analysis and suppress 
the conditioning on (n, x) in subsequent notation. 


The prior distribution 


We present an analysis based on a prior distribution for (a, 3) that is independent and locally 
uniform in the two parameters; that is, p(a,8) x 1. In practice, we might use a uniform 
prior distribution if we really have no prior knowledge about the parameters, or if we want to 
present a simple analysis of this experiment alone. If the analysis using the noninformative 
prior distribution is insufficiently precise, we may consider using other sources of substantive 
information (for example, from other bioassay experiments) to construct an informative 
prior distribution. 
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Figure 3.3 (a) Contour plot for the posterior density of the parameters in the bioassay example. 
Contour lines are at 0.05, 0.15,..., 0.95 times the density at the mode. (b) Scatterplot of 1000 draws 
from the posterior distribution. 


A rough estimate of the parameters 


We will compute the joint posterior distribution (3.15) at a grid of points (a, 8), but before 
doing so, it is a good idea to get a rough estimate of (a, 3) so we know where to look. To 
obtain the rough estimate, we use existing software to perform a logistic regression; that 
is, finding the maximum likelihood estimate of (a, 8) in (3.15) for the four data points in 


Table 3.1. The estimate is (â, 6) = (0.8, 7.7), with standard errors of 1.0 and 4.9 for a and 
B, respectively. 


Obtaining a contour plot of the joint posterior density 


We are now ready to compute the posterior density at a grid of points (a, 3). After some 
experimentation, we use the range (a, 3) € [—5,10] x [—10, 40], which captures almost all 
the mass of the posterior distribution. The resulting contour plot appears in Figure 3.3a; 
a general justification for setting the lowest contour level at 0.05 for two-dimensional plots 


appears in Section 4.1. 


Sampling from the joint posterior distribution 


Having computed the unnormalized posterior density at a grid of values that cover the 
effective range of (a, 8), we can normalize by approximating the distribution as a step 
function over the grid and setting the total probability in the grid to 1. We sample 1000 
random draws (a*, 85) from the posterior distribution using the following procedure. 


1. Compute the marginal posterior distribution of a by numerically summing over ( in the 
discrete distribution computed on the grid of Figure 3.3a. 
2. For s=1,..., 1000: 
(a) Draw aê from the discretely computed p(aly); this can be viewed as a discrete version 
of the inverse cdf method described in Section 1.9. 
(b) Draw 8° from the discrete conditional distribution, p(8la, y), given the just-sampled 
value of a. 
(c) For each of the sampled a and £, add a uniform random jitter centered at zero with 
a width equal to the spacing of the sampling grid. This gives the simulation draws a 
continuous distribution. 


The 1000 draws (af, 3*) are displayed on a scatterplot in Figure 3.3b. The scale of the 


This electronic edition is for non-commercial purposes only. 


3.7. EXAMPLE: ANALYSIS OF A BIOASSAY EXPERIMENT 77 
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Figure 3.4 Histogram of the draws from the posterior distribution of the LD50 (on the scale of log 
dose in g/ml) in the bioassay example, conditional on the parameter B being positive. 


plot, which is the same as the scale of Figure 3.3a, has been set large enough that all the 
1000 draws would fit on the graph. 

There are a number of practical considerations when applying this two-dimensional grid 
approximation. There can be difficulty finding the correct location and scale for the grid 
points. A grid that is defined on too small an area may miss important features of the 
posterior distribution that fall outside the grid. A grid defined on a large area with wide 
intervals between points can miss important features that fall between the grid points. It 
is also important to avoid overflow and underflow operations when computing the poste- 
rior distribution. It is usually a good idea to compute the logarithm of the unnormalized 
posterior distribution and subtract off the maximum value before exponentiating. This 
creates an unnormalized discrete approximation with maximum value 1, which can then be 
normalized (by setting the total probability in the grid to 1). 


The posterior distribution of the LD50 


A parameter of common interest in bioassay studies is the LD50—the dose level at which 
the probability of death is 50%. In our logistic model, a 50% survival rate means 


Yi 


Ni 


LD50: E ( ) = logit™ (a + Bx;) = 0.5; 


thus, a + 6x; = logit(0.5) = 0, and the LD50 is x; = —a/G. Computing the posterior 
distribution of any summaries in the Bayesian approach is straightforward, as discussed at 
the end of Section 1.9. Given what we have done so far, simulating the posterior distribution 
of the LD50 is trivial: we just compute —a/ for the 1000 draws of (a, 3) pictured in Figure 
3.3b. 


Difficulties with the LD50 parameterization if the drug is beneficial. In the context of this 
example, LD50 is a meaningless concept if 8 < 0, in which case increasing the dose does not 
cause the probability of death to increase. If we were certain that the drug could not cause 
the tumor rate to decrease, we should constrain the parameter space to exclude values of 3 
less than 0. However, it seems more reasonable here to allow the possibility of 6 < 0 and 
just note that LD50 is hard to interpret in this case. 

We summarize the inference on the LD50 scale by reporting two results: (1) the posterior 
probability that 6 > 0—that is, that the drug is harmful—and (2) the posterior distribution 
for the LD50 conditional on 8 > 0. All of the 1000 simulation draws had positive values of 
8, so the posterior probability that 6 > 0 is roughly estimated to exceed 0.999. We compute 
the LD50 for the simulation draws with positive values of 8 (which happen to be all 1000 
draws for this example); a histogram is displayed in Figure 3.4. This example illustrates that 
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the marginal posterior mean is not always a good summary of inference about a parameter. 
We are not, in general, interested in the posterior mean of the LD50, because the posterior 
mean includes the cases in which the dose-response relation is negative. 


3.8 Summary of elementary modeling and computation 


The lack of multiparameter models permitting easy calculation of posterior distributions is 
not a major practical handicap for three main reasons. First, when there are few parame- 
ters, posterior inference in nonconjugate multiparameter models can be obtained by simple 
simulation methods, as we have seen in the bioassay example. Second, sophisticated models 
can often be represented in a hierarchical or conditional manner, as we shall see in Chapter 
5, for which effective computational strategies are available (as we discuss in general in Part 
III). Finally, as we discuss in Chapter 4, we can often apply a normal approximation to 
the posterior distribution, and therefore the conjugate structure of the normal model can 
play an important role in practice, well beyond its application to explicitly normal sampling 
models. 

Our successful analysis of the bioassay example suggests the following strategy for com- 
putation of simple Bayesian posterior distributions. What follows is not truly a general ap- 
proach, but it summarizes what we have done so far and foreshadows the general methods— 
based on successive approximations—presented in Part III. 


1. Write the likelihood part of the model, p(y|@), ignoring any factors that are free of 8. 


2. Write the posterior density, p(@|y) x p(@)p(y|@). If prior information is well-formulated, 
include it in p(@). Otherwise use a weakly informative prior distribution or temporarily 
set p(@) x constant, with the understanding that the prior density can be altered later 
to include additional information or structure. 


3. Create a crude estimate of the parameters, 0, for use as a starting point and a comparison 
to the computation in the next step. 


4. Draw simulations 6',...,0°, from the posterior distribution. Use the sample draws to 
compute the posterior density of any functions of 0 that may be of interest. 

5. If any predictive quantities, J, are of interest, simulate j',...,g° by drawing each 9° 
from the sampling distribution conditional on the drawn value 6°, p(g|@*). In Chapter 
6, we discuss how to use posterior simulations of 0 and y to check the fit of the model to 
data and substantive knowledge. 


For nonconjugate models, step 4 above can be difficult. Various methods have been 
developed to draw posterior simulations in complicated models, as we discuss in Part III. 
Occasionally, high-dimensional problems can be solved by combining analytical and nu- 
merical simulation methods. If 0 has only one or two components, it is possible to draw 
simulations by computing on a grid, as we illustrated in the previous section for the bioassay 
example. 


3.9 Bibliographic note 


Chapter 2 of Box and Tiao (1973) thoroughly treats the univariate and multivariate normal 
distribution problems and also some related problems such as estimating the difference 
between two means and the ratio between two variances. At the time that book was 
written, computer simulation methods were much less convenient than they are now, and 
so Box and Tiao, and other Bayesian authors of the period, restricted their attention to 
conjugate families and devoted much effort to deriving analytic forms of marginal posterior 
densities. 

Many textbooks on multivariate analysis discuss the unique mathematical features of 
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Survey Bush Dukakis No opinion/other | Total 
pre-debate 294 307 38 639 
post-debate | 288 332 19 639 


Table 3.2 Number of respondents in each preference category from ABC News pre- and post-debate 
surveys in 1988. 


the multivariate normal distribution, such as the property that all marginal and conditional 
distributions of components of a multivariate normal vector are normal; for example, see 
Mardia, Kent, and Bibby (1979). 

Simon Newcomb’s data, along with a discussion of his experiment, appear in Stigler 
(1977). 

The multinomial model and corresponding informative and noninformative prior distri- 
butions are discussed by Good (1965) and Fienberg (1977); also see the bibliographic note 
on loglinear models at the end of Chapter 16. 

The data and model for the bioassay example appear in Racine et al. (1986), an article 
that presents several examples of simple Bayesian analyses that have been useful in the 
pharmaceutical industry. 


3.10 Exercises 


1. Binomial and multinomial models: suppose data (y1,..., yz) follow a multinomial distri- 
bution with parameters (6),...,0,). Also suppose that 0 = (61,...,0,) has a Dirichlet 
prior distribution. Let a = oa: 


(a) Write the marginal posterior distribution for a. 


(b) Show that this distribution is identical to the posterior distribution for a obtained by 
treating yı as an observation from the binomial distribution with probability a and 
sample size yı + y2, ignoring the data y3,..., YJ. 


This result justifies the application of the binomial distribution to multinomial problems 
when we are only interested in two of the categories; for example, see the next problem. 


2. Comparison of two multinomial observations: on September 25, 1988, the evening of a 
presidential campaign debate, ABC News conducted a survey of registered voters in the 
United States; 639 persons were polled before the debate, and 639 different persons were 
polled after. The results are displayed in Table 3.2. Assume the surveys are independent 
simple random samples from the population of registered voters. Model the data with 
two different multinomial distributions. For 7 = 1,2, let a; be the proportion of voters 
who preferred Bush, out of those who had a preference for either Bush or Dukakis at 
the time of survey j. Plot a histogram of the posterior density for ag — a,. What is the 
posterior probability that there was a shift toward Bush? 


3. Estimation from two independent experiments: an experiment was performed on the 
effects of magnetic fields on the flow of calcium out of chicken brains. Two groups 
of chickens were involved: a control group of 32 chickens and an exposed group of 36 
chickens. One measurement was taken on each chicken, and the purpose of the experiment 
was to measure the average flow ji, in untreated (control) chickens and the average flow 
Lt in treated chickens. The 32 measurements on the control group had a sample mean of 
1.013 and a sample standard deviation of 0.24. The 36 measurements on the treatment 
group had a sample mean of 1.173 and a sample standard deviation of 0.20. 


(a) Assuming the control measurements were taken at random from a normal distribution 
with mean ue and variance g2, what is the posterior distribution of pe? Similarly, use 
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the treatment group measurements to determine the marginal posterior distribution 
of wz. Assume a uniform prior distribution on (He, Ht, log ae, log ot). 

(b) What is the posterior distribution for the difference, Ht — He? To get this, you may 
sample from the independent t distributions you obtained in part (a) above. Plot a 
histogram of your samples and give an approximate 95% posterior interval for juz — He- 


The problem of estimating two normal means with unknown ratio of variances is called 
the Behrens—Fisher problem. 


4. Inference for a 2 x 2 table: an experiment was performed to estimate the effect of beta- 
blockers on mortality of cardiac patients. A group of patients were randomly assigned 
to treatment and control groups: out of 674 patients receiving the control, 39 died, and 
out of 680 receiving the treatment, 22 died. Assume that the outcomes are independent 
and binomially distributed, with probabilities of death of pọ and pı under the control 
and treatment, respectively. We return to this example in Section 5.6. 


(a) Set up a noninformative prior distribution on (po, p1) and obtain posterior simulations. 
(b) Summarize the posterior distribution for the odds ratio, (p1/(1 — pi))/(po/(1 — po)). 
(c) Discuss the sensitivity of your inference to your choice of noninformative prior density. 


5. Rounded data: it is a common problem for measurements to be observed in rounded 
form (for a review, see Heitjan, 1989). For a simple example, suppose we weigh an 
object five times and measure weights, rounded to the nearest pound, of 10, 10, 12, 11, 
9. Assume the unrounded measurements are normally distributed with a noninformative 


prior distribution on the mean p and variance o°. 


(a) Give the posterior distribution for (u, 07) obtained by pretending that the observations 
are exact unrounded measurements. 

(b) Give the correct posterior distribution for (u, 0) treating the measurements as rounded. 

(c) How do the incorrect and correct posterior distributions differ? Compare means, 
variances, and contour plots. 

(d) Let z = (21,..., 25) be the original, unrounded measurements corresponding to the five 
observations above. Draw simulations from the posterior distribution of z. Compute 


the posterior mean of (z1 — z2)?. 


6. Binomial with unknown probability and sample size: some of the difficulties with setting 

prior distributions in multiparameter models can be illustrated with the simple binomial 
distribution. Consider data y;,...,Yn modeled as independent Bin(N, 0), with both N 
and 6 unknown. Defining a convenient family of prior distributions on (N, 0) is difficult, 
partly because of the discreteness of N. 
Raftery (1988) considers a hierarchical approach based on assigning the parameter N 
a Poisson distribution with unknown mean pu. To define a prior distribution on (6, N), 
Raftery defines À = u0 and specifies a prior distribution on (\,@). The prior distribution 
is specified in terms of À rather than u because ‘it would seem easier to formulate prior 
information about A, the unconditional expectation of the observations, than about p, 
the mean of the unobserved quantity N.’ 


(a) A suggested noninformative prior distribution is p(A,@) œ A71. What is a motivation 
for this noninformative distribution? Is the distribution improper? Transform to 
determine p(N, 0). 

(b) The Bayesian method is illustrated on counts of waterbuck obtained by remote pho- 
tography on five separate days in Kruger Park in South Africa. The counts were 
53, 57, 66, 67, and 72. Perform the Bayesian analysis on these data and display a 
scatterplot of posterior simulations of (N,@). What is the posterior probability that 
N > 100? 


(c) Why not simply use a Poisson with fixed y as a prior distribution for N? 
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Type of Bike Counts of bicycles/other vehicles 
street route? 
Residential yes 16/58, 9/90, 10/48, 13/57, 19/103, 
20/57, 18/86, 17/112, 35/273, 55/64 
Residential no 12/113, 1/18, 2/14, 4/44, 9/208, 
7/67, 9/29, 8/154 
Fairly busy yes 8/29, 35/415, 31/425, 19/42, 38/180, 
47/675, 44/620, 44/437, 29/47, 18/462 
Fairly busy no 10/557, 43/1258, 5/499, 14/601, 58/1163, 
15/700, 0/90, 47/1093, 51/1459, 32/1086 
Busy yes 60/1545, 51/1499, 58/1598, 59/503, 53/407, 
68/1494, 68/1558, 60/1706, 71/476, 63/752 
Busy no 8/1248, 9/1246, 6/1596, 9/1765, 19/1290, 


61/2498, 31/2346, 75/3101, 14/1918, 25/2318 


Table 3.3 Counts of bicycles and other vehicles in one hour in each of 10 city blocks in each of 
six categories. (The data for two of the residential blocks were lost.) For example, the first block 
had 16 bicycles and 58 other vehicles, the second had 9 bicycles and 90 other vehicles, and so on. 
Streets were classified as ‘residential,’ ‘fairly busy,’ or ‘busy’ before the data were gathered. 


7. Poisson and binomial distributions: a student sits on a street corner for an hour and 
records the number of bicycles b and the number of other vehicles v that go by. Two 
models are considered: 


e The outcomes b and v have independent Poisson distributions, with unknown means 


A and Oy. 

e The outcome b has a binomial distribution, with unknown probability p and sample 
size b +v. 

Show that the two models have the same likelihood if we define p = aon 5 


8. Analysis of proportions: a survey was done of bicycle and other vehicular traffic in the 
neighborhood of the campus of the University of California, Berkeley, in the spring of 
1993. Sixty city blocks were selected at random; each block was observed for one hour, 
and the numbers of bicycles and other vehicles traveling along that block were recorded. 
The sampling was stratified into six types of city blocks: busy, fairly busy, and residential 
streets, with and without bike routes, with ten blocks measured in each stratum. Table 
3.3 displays the number of bicycles and other vehicles recorded in the study. For this 
problem, restrict your attention to the first four rows of the table: the data on residential 
streets. 


(a) Let y1,..-, Yio and 21,..., 2g be the observed proportion of traffic that was on bicycles 
in the residential streets with bike lanes and with no bike lanes, respectively (so 
yı = 16/(16 + 58) and zı = 12/(12 + 113), for example). Set up a model so that the 
yi’s are independent and identically distributed given parameters @, and the z,’s are 
independent and identically distributed given parameters 0,. 

(b) Set up a prior distribution that is independent in 6, and 0z. 

(c) Determine the posterior distribution for the parameters in your model and draw 1000 
simulations from the posterior distribution. (Hint: @, and 0, are independent in the 
posterior distribution, so they can be simulated independently.) 

(d) Let uy = E(y;|@,) be the mean of the distribution of the y;’s; yy will be a function of 
0y. Similarly, define u+. Using your posterior simulations from (c), plot a histogram of 
the posterior simulations of j1, — uz, the expected difference in proportions in bicycle 
traffic on residential streets with and without bike lanes. 


We return to this example in Exercise 5.13. 
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9. Conjugate normal model: suppose y is an independent and identically distributed sam- 
ple of size n from the distribution N(u, o°), where the prior distribution for (1,07) is 
N-Inv-x?(4, o?| uo, of /K0; Yo, 02); that is, o? ~ Inv-x?(vp,02) and plo? ~ N(uo, 07/kKo). 
The posterior distribution, p(j1,07|y), is also normal-inverse-y?; derive explicitly its pa- 
rameters in terms of the prior parameters and the sufficient statistics of the data. 


10. Comparison of normal variances: for 7 = 1,2, suppose that 


Ypls- ++ Yin; |li, o3 ~ ud N(uj,07), 
plui, o3) x On. 
and (11, 07) are independent of (u2, 73) in the prior distribution. Show that the posterior 
distribution of (s?/s3)/(o7/03) is F with (nı—1) and (n2—1) degrees of freedom. (Hint: 
to show the required form of the posterior density, you do not need to carry along all the 
normalizing constants.) 

11. Computation: in the bioassay example, replace the uniform prior density by a joint nor- 
mal prior distribution on (a, 3), with a ~ N(0,27), 8 ~ N(10, 107), and corr(a, 8)=0.5. 

(a) Repeat all the computations and plots of Section 3.7 with this new prior distribution. 

(b) Check that your contour plot and scatterplot look like a compromise between the prior 

distribution and the likelihood (as displayed in Figure 3.3). 

(c) Discuss the effect of this hypothetical prior information on the conclusions in the 

applied context. 

12. Poisson regression model: expand the model of Exercise 2.13(a) by assuming that the 
number of fatal accidents in year t follows a Poisson distribution with mean a+ pt. You 
will estimate a and £, following the example of the analysis in Section 3.7. 

(a) Discuss various choices for a ‘noninformative’ prior for (a, 3). Choose one. 

(b) Discuss what would be a realistic informative prior distribution for (a, 3). Sketch its 
contours and then put it aside. Do parts (c)—(h) of this problem using your noninfor- 
mative prior distribution from (a). 

Write the posterior density for (a, 8). What are the sufficient statistics? 

Check that the posterior density is proper. 


Plot the contours and take 1000 draws from the joint posterior density of (a, 8). 

Using your samples of (a, 3), plot a histogram of the posterior density for the expected 

number of fatal accidents in 1986, a + 19868. 

(h) Create simulation draws and obtain a 95% predictive interval for the number of fatal 
accidents in 1986. 

(i) How does your hypothetical informative prior distribution in (b) differ from the pos- 

terior distribution in (f) and (g), obtained from the noninformative prior distribution 

and the data? If they disagree, discuss. 


) 

) 
(e) Calculate crude estimates and uncertainties for (a, 3) using linear regression. 
(f) 

) 


13. Multivariate normal model: derive equations (3.12) by completing the square in vector- 
matrix notation. 
14. Improper prior and proper posterior distributions: prove that the posterior density (3.15) 
for the bioassay example has a finite integral over the range (a, 3) E€ (—00, 00) x (—00, 00). 
15. Joint distributions: The autoregressive time-series model y1, y2,... with mean level 0, 
autocorrelation 0.8, residual standard deviation 1, and normal errors can be written as 
(yelYe—15 Ye-2,---) ~ N(O.8y4-1, 1) for all t. 
(a) Prove that the distribution of y+, given the observations at all other integer time points 
t, depends only on y—1 and yy41. 
(b) What is the distribution of y, given y,z-1 and yr41? 
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Chapter 4 


Asymptotics and connections to non-Bayesian 
approaches 


We have seen that many simple Bayesian analyses based on noninformative prior distribu- 
tions give similar results to standard non-Bayesian approaches (for example, the posterior t 
interval for the normal mean with unknown variance). The extent to which a noninforma- 
tive prior distribution can be justified as an objective assumption depends on the amount 
of information available in the data: in the simple cases discussed in Chapters 2 and 3, 
it was clear that as the sample size n increases, the influence of the prior distribution on 
posterior inferences decreases. These ideas, sometimes referred to as asymptotic theory, 
because they refer to properties that hold in the limit as n becomes large, will be reviewed 
in the present chapter, along with some more explicit discussion of the connections between 
Bayesian and non-Bayesian methods. The large-sample results are not actually necessary 
for performing Bayesian data analysis but are often useful as approximations and as tools 
for understanding. 

We begin this chapter with a discussion of the various uses of the normal approximation 
to the posterior distribution. Theorems about consistency and normality of the posterior 
distribution in large samples are outlined in Section 4.2, followed by several counterexamples 
in Section 4.3; proofs of the theorems are sketched in Appendix B. Finally, we discuss how 
the methods of frequentist statistics can be used to evaluate the properties of Bayesian 
inferences. 


4.1 Normal approximations to the posterior distribution 
Normal approximation to the joint posterior distribution 


If the posterior distribution p(@|y) is unimodal and roughly symmetric, it can be convenient 
to approximate it by a normal distribution; that is, the logarithm of the posterior density 
is approximated by a quadratic function of 6. 

Here we consider a quadratic approximation to the log-posterior density that is centered 
at the posterior mode (which in general is easy to compute using off-the-shelf optimization 
routines); in Chapter 13 we discuss more elaborate approximations which can be effective 
in settings where simple mode-based approximations fail. 

A Taylor series expansion of log p(6|y) centered at the posterior mode, 6 (where 0 can 
be a vector and Ê is assumed to be in the interior of the parameter space), gives 

a 1 ar ee 
logp(0lv) = lox ny) + 5(0 ~ 8)" | 


qk) (0-01 (A) 


0=0 


where the linear term in the expansion is zero because the log-posterior density has zero 
derivative at its mode. As we discuss in Section 4.2, the remainder terms of higher order fade 
in importance relative to the quadratic term when @ is close to 0 and n is large. Considering 
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(4.1) as a function of 0, the first term is a constant, whereas the second term is proportional 
to the logarithm of a normal density, yielding the approximation, 


ply) = NO, [1()]-*), (4.2) 


where I(0) is the observed information, 


2 


d 
I(0) =- 702 log p(y). 


If the mode, Ô, is in the interior of parameter space, then the matrix I(@) is positive definite. 


Example. Normal distribution with unknown mean and variance 

We illustrate the approximate normal distribution with a simple theoretical exam- 
ple. Let y1,...,Yn be independent observations from a N(,07) distribution, and, 
for simplicity, we assume a uniform prior density for (u,loga). We set up a normal 
approximation to the posterior distribution of (41,loga), which has the virtue of re- 
stricting g to positive values. To construct the approximation, we need the second 
derivatives of the log posterior density, 


1 
log p(u, log c|y) = constant — n log a — za — 1)s? + n(y— p)”). 
o 


The first derivatives are 


d ng — 
ap “PN log oly) = M, 
d (n — 1)s? + n(g - u)? 
alone) og p(n, log aly) n+ =) 


from which the posterior mode is readily obtained as 


(fi, log) = (= log (= ‘)). 


The second derivatives of the log posterior density are 


2 


5 log p(u,logo|y) = -5 
Tay oem logolu) = —nt 
— log p(u,logaly) = -Z -Hrag 
The matrix of second derivatives at the mode is then ( k a =, l From (4.2), 


the posterior distribution can be approximated as 


p(u, log oly) ~ N (( nee ) ( ne ).( ti 1/2n) )). 


If we had instead constructed the normal approximation in terms of p(u, o°), the sec- 
ond derivative matrix would be multiplied by the Jacobian of the transformation from 
log ø to g? and the mode would change slightly, to 6? = —4;6?. The two components, 
(ut, 07), would still be independent in their approximate posterior distribution, and 


p(o7|y) ~ N(o?|o?, 264/(n + 2). 
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Interpretation of the posterior density function relative to its maximum 


In addition to its direct use as an approximation, the multivariate normal distribution pro- 
vides a benchmark for interpreting the posterior density function and contour plots. In the d- 
dimensional normal distribution, the logarithm of the density function is a constant plus a x2 
distribution divided by —2. For example, the 95th percentile of the y7) density is 18.31, so if 
a problem has d = 10 parameters, then approximately 95% of the posterior probability mass 
is associated with the values of 0 for which p(6|y) is no less than exp(—18.31/2) = 1.1 x 1074 
times the density at the mode. Similarly, with d = 2 parameters, approximately 95% of the 
posterior mass corresponds to densities above exp(—5.99/2) = 0.05, relative to the density 
at the mode. In a two-dimensional contour plot of a posterior density (for example, Figure 
3.3a), the 0.05 contour line thus includes approximately 95% of the probability mass. 


Summarizing posterior distributions by point estimates and standard errors 


The asymptotic theory outlined in Section 4.2 shows that if n is large enough, a posterior 
distribution can be approximated by a normal distribution. In many areas of application, a 
standard inferential summary is the 95% interval obtained by computing a point estimate, 
6, such as the maximum likelihood estimate (which is the posterior mode under a uniform 
prior density), plus or minus two standard errors, with the standard error estimated from 
the information at the estimate, T (ô). A different asymptotic argument justifies the non- 
Bayesian, frequentist interpretation of this summary, but in many simple situations both 
interpretations hold. It is difficult to give general guidelines on when the normal approxi- 
mation is likely to be adequate in practice. From the Bayesian point of view, the accuracy 
in any given example can be directly determined by inspecting the posterior distribution. 
In many cases, convergence to normality of the posterior distribution for a parameter 
0 can be dramatically improved by transformation. If ¢@ is a continuous transformation 
of 0, then both p(¢|y) and p(@|y) approach normal distributions, but the closeness of the 
approximation for finite n can vary substantially with the transformation chosen. 


Data reduction and summary statistics 


Under the normal approximation, the posterior distribution is summarized by its mode, 6, 
and the curvature of the posterior density, I (6); that is, asymptotically, these are sufficient 
statistics. In the examples at the end of the next chapter, we shall see that it can be 
convenient to summarize ‘local-level’ or ‘individual-level’ data from a number of sources by 
their normal-theory sufficient statistics. This approach using summary statistics allows the 
relatively easy application of hierarchical modeling techniques to improve each individual 
estimate. For example, in Section 5.5, each of a set of eight experiments is summarized by 
a point estimate and a standard error estimated from an earlier linear regression analysis. 
Using summary statistics is clearly most reasonable when posterior distributions are close 
to normal; the approach can otherwise discard important information and lead to erroneous 
inferences. 


Lower-dimensional normal approximations 


For a finite sample size n, the normal approximation is typically more accurate for condi- 
tional and marginal distributions of components of 0 than for the full joint distribution. For 
example, if a joint distribution is multivariate normal, all its margins are normal, but the 
converse is not true. Determining the marginal distribution of a component of @ is equiva- 
lent to averaging over all the other components of 0, and averaging a family of distributions 
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Figure 4.1 (a) Contour plot of the normal approximation to the posterior distribution of the pa- 
rameters in the bioassay example. Contour lines are at 0.05, 0.15,...,0.95 times the density at the 
mode. Compare to Figure 3.3a. (b) Scatterplot of 1000 draws from the normal approximation to 
the posterior distribution. Compare to Figure 3.3b. 


generally brings them closer to normality, by the same logic that underlies the central limit 
theorem. 


The normal approximation for the posterior distribution of a low-dimensional @ is often 
perfectly acceptable, especially after appropriate transformation. If 0 is high-dimensional, 
two situations commonly arise. First, the marginal distributions of many individual com- 
ponents of 0 can be approximately normal; inference about any one of these parameters, 
taken individually, can then be well summarized by a point estimate and a standard error. 
Second, it is possible that 0 can be partitioned into two subvectors, 6 = (61,62), for which 
p(92|y) is not necessarily close to normal, but p(01|02, y) is, perhaps with mean and variance 
that are functions of #2. The approach of approximation using conditional distributions is 
often useful, and we consider it more systematically in Section 13.5. Lower-dimensional 
approximations are increasingly popular, for example in computation for latent Gaussian 
models. 


Finally, approximations based on the normal distribution are often useful for debugging 
a computer program or checking a more elaborate method for approximating the posterior 
distribution. 


Example. Bioassay experiment (continued) 

We illustrate the normal approximation for the model and data from the bioassay 
experiment of Section 3.7. The sample size in this experiment is relatively small, only 
twenty animals in all, and we find that the normal approximation is close to the exact 
posterior distribution but with important differences. 


The normal approximation to the joint posterior distribution of (a, G6). To begin, we 
compute the mode of the posterior distribution (using a logistic regression program) 
and the normal approximation (4.2) evaluated at the mode. The posterior mode of 
(a, 8) is the same as the maximum likelihood estimate because we have assumed a 
uniform prior density for (a, 8). Figure 4.1 shows a contour plot of the bivariate normal 
approximation and a scatterplot of 1000 draws from this approximate distribution. 
The plots resemble the plots of the actual posterior distribution in Figure 3.3 but 
without the skewness in the upper right corner of the earlier plots. The effect of 
the skewness is apparent when comparing the mean of the normal approximation, 
(a, 8) = (0.8, 7.7), to the mean of the actual posterior distribution, (a, 8) = (1.4, 11.9), 
computed from the simulations displayed in Figure 3.3b. 
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Figure 4.2 (a) Histogram of the simulations of LD50, conditional on 8 > 0, in the bioassay example 
based on the normal approximation p(a, B|y). The wide tails of the histogram correspond to values 
of B close to 0. Omitted from this histogram are five simulation draws with values of LD50 less 
than —2 and four draws with values greater than 2; the extreme tails are truncated to make the 
histogram visible. The values of LD50 for the 950 simulation draws corresponding to B > 0 had a 
range of [—12.4,5.4]. Compare to Figure 3.4. (b) Histogram of the central 95% of the distribution. 


The posterior distribution for the LD50 using the normal approximation on (a, 3). 
Flaws of the normal approximation. The same set of 1000 draws from the normal 
approximation can be used to estimate the probability that 8 is positive and the 
posterior distribution of the LD50, conditional on ( being positive. Out of the 1000 
simulation draws, 950 had positive values of 3, yielding the estimate Pr(8 > 0) = 0.95, 
a different result than from the exact distribution, where Pr(3 > 0) > 0.999. Con- 
tinuing with the analysis based on the normal approximation, we compute the LD50 
as —a/G for each of the 950 draws with 8 > 0; Figure 4.2a presents a histogram of 
the LD50 values, excluding some extreme values in both tails. (If the entire range of 
the simulations were included, the shape of the distribution would be nearly impos- 
sible to see.) To get a better picture of the center of the distribution, we display in 
Figure 4.2b a histogram of the middle 95% of the 950 simulation draws of the LD50. 
The histograms are centered in approximately the same place as Figure 3.4 but with 
substantially more variation, due to the possibility that 8 is close to zero. 

In summary, posterior inferences based on the normal approximation here are roughly 
similar to the exact results, but because of the small sample, the actual joint posterior 
distribution is substantially more skewed than the large-sample approximation, and 
the posterior distribution of the LD50 actually has much shorter tails than implied by 
using the joint normal approximation. Whether or not these differences imply that 
the normal approximation is inadequate for practical use in this example depends on 
the ultimate aim of the analysis. 


4.2 lLarge-sample theory 


To understand why the normal approximation is often reasonable, we review some theory 
of how the posterior distribution behaves as the amount of data, from some fixed sampling 
distribution, increases. 


Notation and mathematical setup 


The basic tool of large sample Bayesian inference is asymptotic normality of the posterior 
distribution: as more and more data arrive from the same underlying process, the posterior 
distribution of the parameter vector approaches multivariate normality, even if the true 
distribution of the data is not within the parametric family under consideration. Mathe- 
matically, the results apply most directly to observations y1,...,Yn that are independent 
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outcomes sampled from a common distribution, f(y). In many situations, the notion of a 
‘true’ underlying distribution, f(y), for the data is difficult to interpret, but it is necessary 
in order to develop the asymptotic theory. Suppose the data are modeled by a parametric 
family, p(y|9), with a prior distribution p(@). In general, the data points y; and the parame- 
ter 0 can be vectors. If the true data distribution is included in the parametric family—that 
is, if f(y) = p(y|@) for some @>—then, in addition to asymptotic normality, the property 
of consistency holds: the posterior distribution converges to a point mass at the true pa- 
rameter value, 09, as n —> oo. When the true distribution is not included in the parametric 
family, there is no longer a true value 09, but its role in the theoretical result is replaced by 
a value ĝo that makes the model distribution, p(y|@), closest to the true distribution, f(y), 
in a technical sense involving Kullback-Leibler divergence, as is explained in Appendix B. 

In discussing the large-sample properties of posterior distributions, the concept of Fisher 
information, J(@), introduced as (2.20) in Section 2.8 in the context of Jeffreys’ prior dis- 
tributions, plays an important role. 


Asymptotic normality and consistency 


The fundamental mathematical result given in Appendix B shows that, under some regu- 
larity conditions (notably that the likelihood is a continuous function of 0 and that ĝo is 
not on the boundary of the parameter space), as n — oo, the posterior distribution of 6 
approaches normality with mean ĝo and variance (nJ(09))~'. At its simplest level, this 
result can be understood in terms of the Taylor series expansion (4.1) of the log posterior 
density centered about the posterior mode. A preliminary result shows that the posterior 
mode is consistent for 9, so that as n — oo, the mass of the posterior distribution p(6|y) 
becomes concentrated in smaller and smaller neighborhoods of ĝo, and the distance Jĝ — bol 
approaches zero. 
Furthermore, we can rewrite the coefficient of the quadratic term in (4.1): 


2 


Tal = l S E 
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This coefficient is a single term for the prior plus the sum of n likelihood terms, each of 
whose expected value under the true sampling distribution of y;, p(y|@9), is approximately 
—J(0o), as long as Ê is close to ĝo (we are assuming now that f(y) = p(y|90) for some 6o). 
Therefore, for large n, the curvature of the log posterior density can be approximated by 
the Fisher information, evaluated at either Ê or 8o (where only the former is available in 
practice). 

In summary, in the limit of large n, in the context of a specified family of models, 
the posterior mode, 6, approaches ĝo, and the curvature (the observed information or the 
negative of the coefficient of the second term in the Taylor expansion) approaches nJ(0) or 
nJ(0o). In addition, as n — oo, the likelihood dominates the prior distribution, so we can 
just use the likelihood alone to obtain the mode and curvature for the normal approximation. 
More precise statements of the theorems and outlines of proofs appear in Appendix B. 


Likelihood dominating the prior distribution 


The asymptotic results formalize the notion that the importance of the prior distribution 
diminishes as the sample size increases. One consequence of this result is that in problems 
with large sample sizes we need not work especially hard to formulate a prior distribution 
that accurately reflects all available information. When sample sizes are small, the prior 
distribution is a critical part of the model specification. 
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4.3 Counterexamples to the theorems 


A good way to understand the limitations of the large-sample results is to consider cases in 
which the theorems fail. The normal distribution is usually helpful as a starting approxi- 
mation, but one must examine deviations, especially with unusual parameter spaces and in 
the extremes of the distribution. The counterexamples to the asymptotic theorems gener- 
ally correspond to situations in which the prior distribution has an impact on the posterior 
inference, even in the limit of infinite sample sizes. 


Underidentified models and nonidentified parameters. The model is underidentified given 
data y if the likelihood, p(y|@), is equal for a range of values of 6. This may also be called 
a flat likelihood (although that term is sometimes also used for likelihoods for parameters 
that are only weakly identified by the data—so the likelihood function is not strictly equal 
for a range of values, only almost so). Under such a model, there is no single point ĝo to 
which the posterior distribution can converge. 

For example, consider the model, 


(2 )-s((0)G 1): 


in which only one of u or v is observed from each pair (u,v). Here, the parameter p is 
nonidentified. The data supply no information about p, so the posterior distribution of p is 
the same as its prior distribution, no matter how large the dataset is. 

The only solution to a problem of nonidentified or underidentified parameters is to 

recognize that the problem exists and, if there is a desire to estimate these parameters 
more precisely, gather further information that can enable the parameters to be estimated 
(either from future data collection or from external information that can inform a prior 
distribution). 
Number of parameters increasing with sample size. In complicated problems, there can 
be large numbers of parameters, and then we need to distinguish between different types 
of asymptotics. If, as n increases, the model changes so that the number of parameters 
increases as well, then the simple results outlined in Sections 4.1 and 4.2, which assume a 
fixed model class p(y;|0), do not apply. For example, sometimes a parameter is assigned 
for each sampling unit in a study; for example, y; ~ N(6;,07). The parameters 0; generally 
cannot be estimated consistently unless the amount of data collected from each sampling 
unit increases along with the number of units. In nonparametric models such as Gaussian 
processes (see Chapter 21) there can be a new latent parameter corresponding to each data 
point. 

As with underidentified parameters, the posterior distribution for 6; will not converge to 

a point mass if new data do not bring enough information about 6;. Here, the posterior dis- 
tribution will not in general converge to a point in the expanding parameter space (reflecting 
the increasing dimensionality of 0), and its projection into any fixed space—for example, 
the marginal posterior distribution of any particular 6;—will not necessarily converge to a 
point either. 
Aliasing. Aliasing is a special case of underidentified parameters in which the same likeli- 
hood function repeats at a discrete set of points. For example, consider the following normal 
mixture model with independent and identically distributed data y1,..., Yn and parameter 
vector 0 = (m, M2, 07, 03, A): 


1 5 (yi-m1)? al 5 (yi-pe)? 
2 2 202 202 

i 3 , o] „À =} e 1 + 1 = À e 2 

p(y |u H2,01,02 ) r i ( ) Dy OD 


If we interchange each of (u1, 42) and (a7,03), and replace A by (1 — A), the likelihood of 
the data remains the same. The posterior distribution of this model generally has at least 
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two modes and consists of a (50%, 50%) mixture of two distributions that are mirror images 
of each other; it does not converge to a single point no matter how large the dataset is. 

In general, the problem of aliasing is eliminated by restricting the parameter space so 
that no duplication appears; in the above example, the aliasing can be removed by restricting 
lı to be less than or equal to u2. 


Unbounded likelihoods. If the likelihood function is unbounded, then there might be no 
posterior mode within the parameter space, invalidating both the consistency results and 
the normal approximation. For example, consider the previous normal mixture model; for 
simplicity, assume that À is known (and not equal to 0 or 1). If we set uw, = y; for any 
arbitrary y;, and let o? — 0, then the likelihood approaches infinity. As n — oo, the 
number of modes of the likelihood increases. If the prior distribution is uniform on o? and 
o3 in the region near zero, there will be likewise an increasing number of posterior modes, 
with no corresponding normal approximations. A prior distribution proportional to o] "a, 2 
just makes things worse because this puts more probability near zero, causing the posterior 
distribution to explode even faster at zero. 

In general, this problem should arise rarely in practice, because the poles of an un- 
bounded likelihood correspond to unrealistic conditions in a model. The problem can be 
solved by restricting to a plausible set of distributions. When the problem occurs for 
variance components near zero, it can be resolved in various ways, such as using a prior 
distribution that declines to zero at the boundary or by assigning an informative prior 
distribution to the ratio of the variance parameters. 


Improper posterior distributions. If the unnormalized posterior density, obtained by multi- 
plying the likelihood by a ‘formal’ prior density representing an improper prior distribution, 
integrates to infinity, then the asymptotic results, which rely on probabilities summing to 
1, do not follow. An improper posterior distribution cannot occur except with an improper 
prior distribution. 

A simple example arises from combining a Beta(0,0) prior distribution for a binomial 
proportion with data consisting of n successes and 0 failures. More subtle examples, with 
hierarchical binomial and normal models, are discussed in Sections 5.3 and 5.4. 

The solution to this problem is clear. An improper prior distribution is only a convenient 
approximation, and if it does not give rise to a proper posterior distribution then the sought 
convenience is lost. In this case a proper prior distribution is needed, or at least an improper 
prior density that when combined with the likelihood has a finite integral. 


Prior distributions that exclude the point of convergence. If p(69) = 0 for a discrete param- 
eter space, or if p(0) = 0 in a neighborhood about ĝo for a continuous parameter space, then 
the convergence results, which are based on the likelihood dominating the prior distribution, 
do not hold. The solution is to give positive probability density in the prior distribution to 
all values of 0 that are even remotely plausible. 


Convergence to the edge of parameter space. If 0o is on the boundary of the parameter 
space, then the Taylor series expansion must be truncated in some directions, and the 
normal distribution will not necessarily be appropriate, even in the limit. 

For example, consider the model, y; ~ N(0,1), with the restriction 0 > 0. Suppose that 
the model is accurate, with 0 = 0 as the true value. The posterior distribution for 0 is 
normal, centered at J, truncated to be positive. The shape of the posterior distribution for 
0, in the limit as n —> ov, is half of a normal distribution, centered about 0, truncated to 
be positive. 

For another example, consider the same assumed model, but now suppose that the true 
0 is —1, a value outside the assumed parameter space. The limiting posterior distribution 
for 0 has a sharp spike at 0 with no resemblance to a normal distribution at all. The 
solution in practice is to recognize the difficulties of applying the normal approximation if 
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one is interested in parameter values near the edge of parameter space. More important, 
one should give positive prior probability density to all values of 0 that are even remotely 
possible, or in the neighborhood of remotely possible values. 


Tails of the distribution. The normal approximation can hold for essentially all the mass 
of the posterior distribution but still not be accurate in the tails. For example, suppose 
p(O|y) is proportional to e~°!®! as |6| — oo, for some constant c; by comparison, the normal 
density is proportional to e-. The distribution function still converges to normality, but 
for any finite sample size n the approximation fails far out in the tail. As another example, 
consider any parameter that is constrained to be positive. For any finite sample size, the 
normal approximation will admit the possibility of the parameter being negative, because 
the approximation is simply not appropriate at that point in the tail of the distribution, 
but that point becomes farther and farther in the tail as n increases. 


4.4 Frequency evaluations of Bayesian inferences 


Just as the Bayesian paradigm can be seen to justify simple ‘classical’ techniques, the 
methods of frequentist statistics provide a useful approach for evaluating the properties of 
Bayesian inferences—their operating characteristics—when these are regarded as embedded 
in a sequence of repeated samples. We have already used this notion in discussing the ideas 
of consistency and asymptotic normality. The notion of stable estimation, which says that 
for a fixed model, the posterior distribution approaches a point as more data arrive— 
leading, in the limit, to inferential certainty—is based on the idea of repeated sampling. 
It is certainly appealing that if the hypothesized family of probability models contains the 
true distribution (and assigns it a nonzero prior density), then as more information about 
0 arrives, the posterior distribution converges to the true value of 0. 


Large-sample correspondence 


Suppose that the normal approximation (4.2) for the posterior distribution of 0 holds; then 
we can transform to the standard multivariate normal: 


[7()}'/?(6 — ô) |y ~ N(O, 1), (4.3) 


where 6 is the posterior mode and [Z(ĝ)]!/2 is any matrix square root of I(Ô). In addition, 
Ê — bo, and so we could just as well write the approximation in terms of I (80). If the true 
data distribution is included in the class of models, so that f(y) = p(y|@) for some 0, then 
in repeated sampling with fixed 0, in the limit n — ov, it can be proved that 


[7(6)}"/?(8 — ô) 10 ~ N(O, 7), (4.4) 


a result from classical statistical theory that is generally proved for Ê equal to the maximum 
likelihood estimate but is easily extended to the case with 6 equal to the posterior mode. 
These results mean that, for any function of (@— Ê), the posterior distribution derived from 
(4.3) is asymptotically the same as the repeated sampling distribution derived from (4.4). 
Thus, for example, a 95% central posterior interval for 0 will cover the true value 95% of 
the time under repeated sampling with any fixed true @. 


Point estimation, consistency, and efficiency 


In the Bayesian framework, obtaining an ‘estimate’ of 0 makes most sense in large samples 
when the posterior mode, 6, is the obvious center of the posterior distribution of 0 and the 


uncertainty conveyed by nI(@) is so small as to be practically unimportant. More generally, 
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however, in smaller samples, it is inappropriate to summarize inference about 0 by one 
value, especially when the posterior distribution of 0 is more variable or even asymmetric. 
Formally, by incorporating loss functions in a decision-theoretic context (see Section 9.1 
and Exercise 9.6), one can define optimal point estimates; for the purposes of Bayesian data 
analysis, however, we believe that representation of the full posterior distribution (as, for 
example, with 50% and 95% central posterior intervals) is more useful. In many problems, 
especially with large samples, a point estimate and its estimated standard error are adequate 
to summarize a posterior inference, but we interpret the estimate as an inferential summary, 
not as the solution to a decision problem. In any case, the large-sample frequency properties 
of any estimate can be evaluated, without consideration of whether the estimate was derived 
from a Bayesian analysis. 

A point estimate is said to be consistent in the sampling theory sense if, as samples 
get larger, it converges to the true value of the parameter that it is asserted to estimate. 
Thus, if f(y) = p(y|90), then a point estimate Ê of 8 is consistent if its sampling distribution 
converges to a point mass at 09 as the data sample size n increases (that is, considering 
Ê as a function of y, which is a random variable conditional on 0). A closely related 
concept is asymptotic unbiasedness, where (E(4|0)) — 99)/sd(6|90) converges to 0 (once 
again, considering 6(y) as a random variable whose distribution is determined by p(y|9o)). 
When the truth is included in the family of models being fitted, the posterior mode 6, and 
also the posterior mean and median, are consistent and asymptotically unbiased under mild 
regularity conditions. 

A point estimate 6 is said to be efficient if there exists no other function of y that 
estimates 0 with lower mean squared error, that is, if the expression E((4 — 00)2|0o) is at 
its optimal, lowest value. More generally, the efficiency of Ê is the optimal mean squared 
error divided by the mean squared error of 6. An estimate is asymptotically efficient if its 
efficiency approaches 1 as the sample size n + oo. Under mild regularity conditions, the 
center of the posterior distribution (defined, for example, by the posterior mean, median, 
or mode) is asymptotically efficient. 


Confidence coverage 


If a region C(y) includes fo at least 100(1 — a)% of the time (given any value of 09) in 
repeated samples, then Cy) is called a 100(1 — a)% confidence region for the parameter 
0. The word ‘confidence’ is carefully chosen to distinguish such intervals from probability 
intervals and to convey the following behavioral meaning: if one chooses a to be small 
enough (for example, 0.05 or 0.01), then since confidence regions cover the truth in at least 
(1 — a) of their applications, one should be confident in each application that the truth 
is within the region and therefore act as if it is. We saw previously that asymptotically a 
100(1 — a)% central posterior interval for 0 has the property that, in repeated samples of 
y, 100(1 — a)% of the intervals include the value Oo. 


4.5 Bayesian interpretations of other statistical methods 


We consider three levels at which Bayesian statistical methods can be compared with other 
methods. First, as we have already indicated, Bayesian methods are often similar to other 
statistical approaches in problems involving large samples from a fixed probability model. 
Second, even for small samples, many statistical methods can be considered as approxi- 
mations to Bayesian inferences based on particular prior distributions; as a way of under- 
standing a statistical procedure, it is often useful to determine the implicit underlying prior 
distribution. Third, some methods from classical statistics (notably hypothesis testing) 
can give results that differ greatly from those given by Bayesian methods. In this section, 
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we briefly consider several statistical concepts—point and interval estimation, likelihood 
inference, unbiased estimation, frequency coverage of confidence intervals, hypothesis test- 
ing, multiple comparisons, nonparametric methods, and the jackknife and bootstrap—and 
discuss their relation to Bayesian methods. 

One way to develop possible models is to examine the interpretation of crude data- 
analytic procedures as approximations to Bayesian inference under specific models. For 
example, a widely used technique in sample surveys is ratio estimation, in which, for exam- 
ple, given data from a simple random sample, one estimates R = 7/Z by Jobs/Tobs, in the 
notation of Chapter 8. It can be shown that this estimate corresponds to a summary of 
a Bayesian posterior inference given independent observations y;|x; ~ N(Raj,o7x;) and a 
noninformative prior distribution. Ratio estimates can be useful in a wide variety of cases 
in which this model does not hold, but when the data deviate greatly from this model, the 
ratio estimate generally is not appropriate. 

For another example, standard methods of selecting regression predictors, based on 
‘statistical significance,’ correspond roughly to Bayesian analyses under exchangeable prior 
distributions on the coefficients in which the prior distribution of each coefficient is a mixture 
of a peak at zero and a widely spread distribution, as we discuss further in Section 14.6. We 
believe that understanding this correspondence suggests when such models can be usefully 
applied and how they can be improved. Often, in fact, such procedures can be improved 
by including additional information, for example, in problems involving large numbers of 
predictors, by clustering regression coefficients that are likely to be similar into batches. 


Maximum likelihood and other point estimates 


From the perspective of Bayesian data analysis, we can often interpret classical point esti- 
mates as exact or approximate posterior summaries based on some implicit full probability 
model. In the limit of large sample size, in fact, we can use asymptotic theory to con- 
struct a theoretical Bayesian justification for classical maximum likelihood inference. In the 
limit (assuming regularity conditions), the maximum likelihood estimate, 0, is a sufficient 
statistic—and so is the posterior mode, mean, or median. That is, for large enough n, 
the maximum likelihood estimate (or any of the other summaries) supplies essentially all 
the information about @ available from the data. The asymptotic irrelevance of the prior 
distribution can be taken to justify the use of convenient noninformative prior models. 
In repeated sampling with 0 = 6, 


p((y)|9=80) = N(A(y)|@0, (nJ(80))~*); 


that is, the sampling distribution of 6(y) is approximately normal with mean fo and precision 
nJ(09), where for clarity we emphasize that 6 is a function of y. Assuming that the prior 
distribution is locally uniform (or continuous and nonzero) near the true 0, the simple 
analysis of the normal mean (Section 3.5) shows that the posterior Bayesian inference is 


p(0|0) = N(6|4, (nJ(6))~1). 


This result appears directly from the asymptotic normality theorem, but deriving it indi- 
rectly through Bayesian inference given 6 gives insight into a Bayesian rationale for classical 
asymptotic inference based on point estimates and standard errors. 

For finite n, the above approach is inefficient or wasteful of information to the extent 
that Ê is not a sufficient statistic. When the number of parameters is large, the consistency 
result is often not helpful, and noninformative prior distributions are hard to justify. As 
discussed in Chapter 5, hierarchical models are preferable when dealing with a large number 
of parameters since then their common distribution can be estimated from data. In addi- 
tion, any method of inference based on the likelihood alone can be improved if real prior 
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information is available that is strong enough to contribute substantially to that contained 
in the likelihood function. 


Unbiased estimates 


Some non-Bayesian statistical methods place great emphasis on unbiasedness as a desirable 
principle of estimation, and it is intuitively appealing that, over repeated sampling, the mean 
(or perhaps the median) of a parameter estimate should be equal to its true value. Formally, 
an estimate 6(y) is called unbiased if E(6(y)|@) = @ for any value of 0, where this expectation 
is taken over the data distribution, p(y|@). From a Bayesian perspective, the principle of 
unbiasedness is reasonable in the limit of large samples (see page 92) but otherwise is 
potentially misleading. The major difficulties arise when there are many parameters to be 
estimated and our knowledge or partial knowledge of some of these parameters is clearly 
relevant to the estimation of others. Requiring unbiased estimates will often lead to relevant 
information being ignored (as we discuss with hierarchical models in Chapter 5). In sampling 
theory terms, minimizing bias will often lead to counterproductive increases in variance. 

One general problem with unbiasedness (and point estimation in general) is that it is 
often not possible to estimate several parameters at once in an even approximately unbiased 
manner. For example, unbiased estimates of 61,...,0 7 yield an upwardly biased estimate 
of the variance of the 0;’s (except in the trivial case in which the 0;’s are known exactly). 

Another problem with the principle of unbiasedness arises when treating a future ob- 
servable value as a parameter in prediction problems. 


Example. Prediction using regression 

Consider the problem of estimating 0, the height of an adult daughter, given y, her 
mother’s height. For simplicity, assume that the heights of mothers and daughters 
are jointly normally distributed, with known equal means of 160 centimeters, equal 
variances, and a known correlation of 0.5. Conditioning on the known value of y (in 
other words, using Bayesian inference), the posterior mean of 0 is 


E(6|y) = 160 + 0.5(y — 160). (4.5) 


The posterior mean is not, however, an unbiased estimate of 6, in the sense of repeated 
sampling of y given a fixed 0. Given the daughter’s height, 0, the mother’s height, y, 
has mean E(y|0) = 160 + 0.5(0 — 160). Thus, under repeated sampling of y given fixed 
0, the posterior mean (4.5) has expectation 160 + 0.25(@ — 160) and is biased towards 
the grand mean of 160. In contrast, the estimate 


6 = 160 + 2(y — 160) 


is unbiased under repeated sampling of y, conditional on 6. Unfortunately, the esti- 
mate 6 makes no sense for values of y not equal to 160; for example, if a mother is 10 
centimeters taller than average, it estimates her daughter to be 20 centimeters taller 
than average! 

In this simple example, in which 0 has an accepted population distribution, a sensible 
non-Bayesian statistician would not use the unbiased estimate 6: instead, this problem 
would be classified as ‘prediction’ rather than ‘estimation,’ and procedures would not 
be evaluated conditional on the random variable 6. The example illustrates, however, 
the limitations of unbiasedness as a general principle: it requires unknown quantities to 
be characterized either as ‘parameters’ or ‘predictions,’ with different implications for 
estimation but no clear substantive distinction. Chapter 5 considers similar situations 
in which the population distribution of 0 must be estimated from data rather than 
conditioning on a particular value. 
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The important principle illustrated by the example is that of regression to the mean: 
for any given mother, the expected value of her daughter’s height lies between her 
mother’s height and the population mean. This principle was fundamental to the 
original use of the term ‘regression’ for this type of analysis by Galton in the late 
19th century. In many ways, Bayesian analysis can be seen as a logical extension of 
the principle of regression to the mean, ensuring that proper weighting is made of 
information from different sources. 


Confidence intervals 


Even in small samples, Bayesian (1 — a) posterior intervals often have close to (1 — a) con- 
fidence coverage under repeated samples conditional on 6. But there are some confidence 
intervals, derived purely from sampling-theory arguments, that differ considerably from 
Bayesian probability intervals. From our perspective these intervals are of doubtful value. 
For example, many authors have shown that a general theory based on unconditional behav- 
ior can lead to clearly counterintuitive results, for example, the possibilities of confidence 
intervals with zero or infinite length. A simple example is the confidence interval that is 
empty 5% of the time and contains all of the real line 95% of the time: this always contains 
the true value (of any real-valued parameter) in 95% of repeated samples. Such examples 
do not imply that there is no value in the concept of confidence coverage but rather show 
that coverage alone is not a sufficient basis on which to form reasonable inferences. 


Hypothesis testing 


The perspective of this book has little role for the non-Bayesian concept of hypothesis 
tests, especially where these relate to point null hypotheses of the form 0 = 69. In order 
for a Bayesian analysis to yield a nonzero probability for a point null hypothesis, it must 
begin with a nonzero prior probability for that hypothesis; in the case of a continuous 
parameter, such a prior distribution (comprising a discrete mass, of say 0.5, at 09 mixed 
with a continuous density elsewhere) usually seems contrived. In fact, most of the difficulties 
in interpreting hypothesis tests arise from the artificial dichotomy that is required between 
0 = 6) and 6 Æ 0o. Difficulties related to this dichotomy are widely acknowledged from all 
perspectives on statistical inference. In problems involving a continuous parameter 0 (say 
the difference between two means), the hypothesis that 0 is exactly zero is rarely reasonable, 
and it is of more interest to estimate a posterior distribution or a corresponding interval 
estimate of 0. For a continuous parameter 6, the question ‘Does 0 equal 0?’ can generally 
be rephrased more usefully as ‘What is the posterior distribution for 6?’ 


In various simple one-sided hypothesis tests, conventional p-values may correspond with 
posterior probabilities under noninformative prior distributions. For example, suppose we 
observe y = 1 from the model y ~ N(6,1), with a uniform prior density on 0. One cannot 
‘reject the hypothesis’ that 6 = 0: the one-sided p-value is 0.16 and the two-sided p-value 
is 0.32, both greater than the conventionally accepted cutoff value of 0.05 for ‘statistical 
significance.’ On the other hand, the posterior probability that 0 >0 is 84%, which is a 
more satisfactory and informative conclusion than the dichotomous verdict ‘reject’ or ‘do 
not reject.’ 


In contrast to the problem of making inference about a parameter within a particular 
model, we do find a form of hypothesis test to be useful when assessing the goodness of fit of 
a probability model. In the Bayesian framework, it is useful to check a model by comparing 
observed data to possible predictive outcomes, as we discuss in detail in Chapter 6. 
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Multiple comparisons and multilevel modeling 


Consider a problem with independent measurements, y; ~ N(6;,1), on each of J parameters, 
in which the goal is to detect differences among and ordering of the continuous parame- 
ters 0j. Several competing multiple comparisons procedures have been derived in classical 
statistics, with rules about when various @;’s can be declared ‘significantly different.’ In 
the Bayesian approach, the parameters have a joint posterior distribution. One can com- 
pute the posterior probability of each of the J! orderings if desired. If there is posterior 
uncertainty in the ordering, several permutations will have substantial probabilities, which 
is a more reasonable conclusion than producing a list of 0;’s that can be declared different 
(with the false implication that other 6;’s may be exactly equal). With J large, the exact 
ordering is probably not important, and it might be more reasonable to give a posterior 
median and interval estimate of the quantile of each 0; in the population. 

We prefer to handle multiple comparisons problems using hierarchical models, as we 
shall illustrate in a comparison of treatment effects in eight schools in Section 5.5 (see also 
Exercise 5.3). Hierarchical modeling automatically partially pools estimates of different 0;’s 
toward each other when there is little evidence for real variation. As a result, this Bayesian 
procedure automatically addresses the key concern of classical multiple comparisons analy- 
sis, which is the possibility of finding large differences as a byproduct of searching through 
so many possibilities. For example, in the educational testing example, the eight schools 
give 8- 7/2 = 28 possible comparisons, and none turn out to be close to ‘statistically sig- 
nificant’ (in the sense that zero is contained within the 95% intervals for all the differences 
in effects between pairs of schools), which makes sense since the between-school variation 
(the parameter 7 in that model) is estimated to be low. 


Nonparametric methods, permutation tests, jackknife, bootstrap 


Many non-Bayesian methods have been developed that avoid complete probability models, 
even at the sampling level. It is difficult to evaluate many of these from a Bayesian point 
of view. For instance, hypothesis tests for comparing medians based on ranks do not have 
direct counterparts in Bayesian inference; therefore it is hard to interpret the resulting es- 
timates and p-values from a Bayesian point of view (for example, as posterior expectations, 
intervals, or probabilities for parameters or predictions of interest). In complicated prob- 
lems, there is often a degree of arbitrariness in the procedures used; for example there is 
generally no clear method for constructing a nonparametric inference or an estimator to 
jackknife/bootstrap in hypothetical replications. Without a specified probability model, 
it is difficult to see how to test the assumptions underlying a particular nonparametric 
method. In such problems, we find it more satisfactory to construct a joint probability 
distribution and check it against the data (as in Chapter 6) than to construct an estimator 
and evaluate its frequency properties. Nonparametric methods are useful to us as tools for 
data summary and description that can help us to construct models or help us evaluate 
inferences from a completely different perspective. 

From a different direction, one might well say that Bayesian methods involve arbitrary 
choices of models and are difficult to evaluate because in practice there will always be 
important aspects of a model that are impossible to check. Our purpose here is not to 
dismiss or disparage classical nonparametric methods but rather to put them in a Bayesian 
context to the extent this is possible. 

Some nonparametric methods such as permutation tests for experiments and sampling- 
theory inference for surveys turn out to give similar results in simple problems to Bayesian 
inferences with noninformative prior distributions, if the Bayesian model is constructed 
to fit the data reasonably well. Such simple problems include balanced designs with no 
missing data and surveys based on simple random sampling. When estimating several pa- 
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rameters at once or including explanatory variables in the analysis (using methods such as 
the analysis of covariance or regression) or prior information on the parameters, the permu- 
tation/sampling theory methods give no direct answer, and this often provides considerable 
practical incentive to move to a model-based Bayesian approach. 


Example. The Wilcoxon rank test 

Another connection can be made by interpreting nonparametric methods in terms 
of implicit models. For example, the Wilcoxon rank test for comparing two samples 
(Y1,+-+,Yn,) and (21,...,2n_) proceeds by first ranking each of the points in the com- 
bined data from 1 ton = ny +nz, then computing the difference between the average 
ranks of the y’s and z’s, and finally computing the p-value of this difference by com- 
paring to a tabulated reference distribution calculated based on the assumption of 
random assignment of the n ranks. This can be formulated as a nonlinear transfor- 
mation that replaces each data point by its rank in the combined data, followed by 
a comparison of the mean values of the two transformed samples. Even more clear 
would be to transform the ranks 1,2,..., to quantiles x. 3, a mn, so that the 
difference between the two means can be interpreted as an average distance in the scale 
of the quantiles of the combined distribution. From the Central Limit Theorem, the 
mean difference is approximately normally distributed, and so classical normal-theory 
confidence intervals can be interpreted as Bayesian posterior probability statements, 
as discussed at the beginning of this section. 

We see two major advantages of expressing rank tests as approximate Bayesian infer- 
ences. First, the Bayesian framework is more flexible than rank testing for handling 
the complications that arise, for example, from additional information such as regres- 
sion predictors or from complications such as censored or truncated data. Second, 
setting up the problem in terms of a nonlinear transformation reveals the general- 
ity of the model-based approach—we are free to use any transformation that might 
be appropriate for the problem, perhaps now treating the combined quantiles as a 
convenient default choice. 


4.6 Bibliographic note 


Relatively little has been written on the practical implications of asymptotic theory for 
Bayesian analysis. The overview by Edwards, Lindman, and Savage (1963) remains one of 
the best and includes a detailed discussion of the principle of ‘stable estimation’ or when 
prior information can be satisfactorily approximated by a uniform density function. Much 
more has been written comparing Bayesian and non-Bayesian approaches to inference, and 
we have largely ignored the extensive philosophical and logical debates on this subject. 
Some good sources on the topic from the Bayesian point of view include Lindley (1958), 
Pratt (1965), and Berger and Wolpert (1984). Jaynes (1976) discusses some disadvantages 
of non-Bayesian methods compared to a particular Bayesian approach. 

In Appendix B we provide references to the asymptotic normality theory. The coun- 
terexamples presented in Section 4.3 have arisen, in various forms, in our own applied 
research. Berzuini et al. (1997) discuss Bayesian inference for sequential data problems, in 
which the posterior distribution changes as data arrive, thus approaching the asymptotic 
results dynamically. 

An example of the use of the normal approximation with small samples is provided by 
Rubin and Schenker (1987), who approximate the posterior distribution of the logit of the 
binomial parameter in a real application and evaluate the frequentist operating character- 
istics of their procedure; see also Agresti and Coull (1998). Clogg et al. (1991) provide 
additional discussion of this approach in a more complicated setting. 

Morris (1983) and Rubin (1984) discuss, from two different standpoints, the concept 
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of evaluating Bayesian procedures by examining long-run frequency properties (such as 
coverage of 95% confidence intervals). An example of frequency evaluation of Bayesian 
procedures in an applied problem is given by Zaslavsky (1993). 

Krantz (1999) discusses the strengths and weaknesses of p-values as used in statistical 
data analysis in practice. Discussions of the role of p-values in Bayesian inference appear 
in Bayarri and Berger (1998, 2000). Earlier work on the Bayesian analysis of hypothesis 
testing and the problems of interpreting conventional p-values is provided by Berger and 
Sellke (1987), which contains a lively discussion and many further references. Gelman 
(2008a) and discussants provide a more recent airing of arguments for and against Bayesian 
statistics. Gelman (2006b) compares Bayesian inference and the more generalized approach 
known as belief functions (Dempster, 1967, 1968) using a simple toy example. 

Greenland and Poole (2013) and Gelman (2013a) present some more recent discussions 
of the relevance of classical p-values in Bayesian inference. 

A simple and pragmatic discussion of the need to consider Bayesian ideas in hypothesis 
testing in a biostatistical context is given by Browner and Newman (1987), and further dis- 
cussion of the role of Bayesian thinking in medical statistics appears in Goodman (1999a, b) 
and Sterne and Smith (2001). Gelman and Tuerlinckx (2000), Efron and Tibshirani (2002), 
and Gelman, Hill, and Yajima (2012) give a Bayesian perspective on multiple comparisons 
in the context of hierarchical modeling. 

Stigler (1983) discusses the similarity between Bayesian inference and regression predic- 
tion that we mention in our critique of unbiasedness in Section 4.5; Stigler (1986) discusses 
Galton’s use of regression. 

Sequential monitoring and analysis of clinical trials in medical research is an important 
area of practical application that has been dominated by frequentist thinking but has re- 
cently seen considerable discussion of the merits of a Bayesian approach; recent reviews 
and examples are provided by Freedman, Spiegelhalter, and Parmar (1994), Parmar et al. 
(2001), and Vail et al. (2001). Thall, Simon, and Estey (1995) consider frequency properties 
of Bayesian analyses of sequential trials. More references on sequential designs appear in 
the bibliographic note at the end of Chapter 8. 

The non-Bayesian principles and methods mentioned in Section 4.5 are covered in many 
books, for example, Lehmann (1983, 1986), Cox and Hinkley (1974), Hastie and Tibshi- 
rani (1990), and Efron and Tibshirani (1993). The connection between ratio estimation 
and modeling alluded to in Section 4.5 is discussed by Brewer (1963), Royall (1970), and, 
from our Bayesian approach, Rubin (1987a, p. 46). Conover and Iman (1980) discuss the 
connection between nonparametric tests and data transformations. 


4.7 Exercises 


1. Normal approximation: suppose that y),...,Yy5 are independent samples from a Cauchy 
distribution with unknown center 0 and known scale 1: p(y;|@) « 1/(1 + (yi — 0)?). 
Assume that the prior distribution for @ is uniform on [0,1]. Given the observations 
(yi,---.¥8) = (—2, —1, 0, 1.5, 2.5): 

(a) Determine the derivative and the second derivative of the log posterior density. 

(b) Find the posterior mode of 0 by iteratively solving the equation determined by setting 
the derivative of the log-likelihood to zero. 

(c) Construct the normal approximation based on the second derivative of the log posterior 
density at the mode. Plot the approximate normal density and compare to the exact 
density as computed using the approach described in Exercise 2.11. 


2. Normal approximation: derive the analytic form of the information matrix and the nor- 
mal approximation variance for the bioassay example. 


3. Normal approximation to the marginal posterior distribution of an estimand: in the 
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bioassay example, the normal approximation to the joint posterior distribution of (a, 3) 
is obtained. The posterior distribution of any estimand, such as the LD50, can be ap- 
proximated by a normal distribution fit to its marginal posterior mode and the curvature 
of the marginal posterior density about the mode. This is sometimes called the ‘delta 
method.’ Expand the posterior distribution of the LD50, —a/(, as a Taylor series around 
the posterior mode and thereby derive the asymptotic posterior median and standard 
deviation. Compare to the histogram in Figure 4.2. 


4. Asymptotic normality: assuming the regularity conditions hold, we know that p(@ly) 
approaches normality as n — oo. In addition, if ¢ = f(A) is any one-to-one continuous 
transformation of 0, we can express the Bayesian inference in terms of ¢ and find that 
p(dly) also approaches normality. But a nonlinear transformation of a normal distribution 
is no longer normal. How can both limiting normal distributions be valid? 

5. Approximate mean and variance: 

(a) Suppose x and y are independent normally distributed random variables, where x has 
mean 4 and standard deviation 1, and y has mean 3 and standard deviation 2. What 
are the mean and standard deviation of y/x? Compute this using simulation. 

(b) Suppose x and y are independent random variables, where x has mean 4 and standard 
deviation 1, and y has mean 3 and standard deviation 2. What are the approximate 
mean and standard deviation of y/x? Determine this without using simulation. 

(c) What assumptions are required for the approximation in (b) to be reasonable? 

6. Statistical decision theory: a decision-theoretic approach to the estimation of an unknown 
parameter 0 introduces the loss function L(0, a) which, loosely speaking, gives the cost of 
deciding that the parameter has the value a, when it is in fact equal to 0. The estimate 
a can be chosen to minimize the posterior expected loss, 


E(L(aly)) = I L(0, a)p(6ly)d9. 


This optimal choice of a is called a Bayes estimate for the loss function L. Show that: 
(a) If L(0,a) = (0 — a)? (squared error loss), then the posterior mean, E(6|y), if it exists, 
is the unique Bayes estimate of 0. 
(b) If L(0,a) = |0 — a|, then any posterior median of 6 is a Bayes estimate of 0. 
(c) If ko and kı are nonnegative numbers, not both zero, and 


_ J| ko(0—a) if 0>a 
(0,4) =f klam) # O<a, 


then any oe quantile of the posterior distribution p(6|y) is a Bayes estimate of 0. 


7. Unbiasedness: prove that the Bayesian posterior mean, based on a proper prior distri- 
bution, cannot be an unbiased estimator except in degenerate problems (see Bickel and 
Blackwell, 1967, and Lehmann, 1983, p. 244). 


8. Regression to the mean: work through the details of the example of mother’s and daugh- 
ter’s heights on page 94, illustrating with a sketch of the joint distribution and relevant 
conditional distributions. 


9. Point estimation: suppose a measurement y is recorded with a N(6,07) sampling dis- 
tribution, with ø known exactly and @ known to lie in the interval [0,1]. Consider two 
point estimates of 0: (1) the maximum likelihood estimate, restricted to the range (0, 1], 
and (2) the posterior mean based on the assumption of a uniform prior distribution on 
0. Show that if ø is large enough, estimate (1) has a higher mean squared error than 
(2) for any value of @ in [0,1]. (The unrestricted maximum likelihood estimate has even 
higher mean squared error.) 
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10 


11 


12. 


13. 


14. 


15. 


. Non-Bayesian inference: replicate the analysis of the bioassay example in Section 3.7 


using non-Bayesian inference. This problem does not have a unique answer, so be clear 
on what methods you are using. 

(a) Construct an ‘estimator’ of (a, 3); that is, a function whose input is a dataset, (x, n, y), 
and whose output is a point estimate (â, Ê). Compute the value of the estimate for 
the data given in Table 3.1. 

(b) The bias and variance of this estimate are functions of the true values of the parameters 
(a, 8) and also of the sampling distribution of the data, given a,3. Assuming the 
binomial model, estimate the bias and variance of your estimator. 

(c) Create approximate 95% confidence intervals for a, 3, and the LD50 based on asymp- 
totic theory and the estimated bias and variance. 

(d) Does the inaccuracy of the normal approximation for the posterior distribution (com- 
pare Figures 3.3 and 4.1) cast doubt on the coverage properties of your confidence 
intervals in (c)? If so, why? 

(e) Create approximate 95% confidence intervals for a, 6, and the LD50 using the jack- 
knife or bootstrap (see Efron and Tibshirani, 1993). 

(f) Compare your 95% intervals for the LD50 in (c) and (e) to the posterior distribution 
displayed in Figure 3.4 and the posterior distribution based on the normal approxima- 
tion, displayed in 4.2b. Comment on the similarities and differences among the four 
intervals. Which do you prefer as an inferential summary about the LD50? Why? 


. Bayesian interpretation of non-Bayesian estimates: consider the following estimation 


procedure, which is based on classical hypothesis testing. A matched pairs experiment 
is done, and the differences y1,..., Yn are recorded and modeled as independent draws 
from N(6,07). For simplicity, assume ø? is known. The parameter 6 is estimated as the 
average observed difference if it is ‘statistically significant’ and zero otherwise: 


A) 9 if Y > 1.960/./n 
~ ) 0 otherwise. 


Can this be interpreted, in some sense, as an approximate summary (for example, a 
posterior mean or mode) of a Bayesian inference under some prior distribution on 6? 


Bayesian interpretation of non-Bayesian estimates: repeat the above problem but with 
o replaced by s, the sample standard deviation of y1,..., Yn- 


Objections to Bayesian inference: discuss the criticism, ‘Bayesianism assumes: (a) Hither 
a weak or uniform prior [distribution], in which case why bother?, (b) Or a strong prior 
[distribution], in which case why collect new data?, (c) Or more realistically, something 
in between, in which case Bayesianism always seems to duck the issue’ (Ehrenberg, 1986). 
Feel free to use any of the examples covered so far to illustrate your points. 


Objectivity and subjectivity: discuss the statement, ‘People tend to believe results that 
support their preconceptions and disbelieve results that surprise them. Bayesian methods 
encourage this undisciplined mode of thinking.’ 

Coverage of posterior intervals: 

(a) Consider a model with scalar parameter 0. Prove that, if you draw 8 from the prior, 
draw y|@ from the data model, then perform Bayesian inference for 6 given y, that 
there is a 50% probability that your 50% interval for 0 contains the true value. 

(b) Suppose @ ~ N(0, 2?) and y| ~ N(6,1). Suppose the true value of @ is 1. What is the 
coverage of the posterior 50% interval for 0? (You have to work this one out; it’s not 
50% or any other number you could just guess.) 

(c) Suppose 6 ~ N(0,2?) and y|@ ~ N(6,1). Suppose the true value of @ is 09. Make a 
plot showing the coverage of the posterior 50% interval for 0, as a function of ĝo. 


This electronic edition is for non-commercial purposes only. 


Chapter 5 


Hierarchical models 


Many statistical applications involve multiple parameters that can be regarded as related 
or connected in some way by the structure of the problem, implying that a joint probability 
model for these parameters should reflect their dependence. For example, in a study of the 
effectiveness of cardiac treatments, with the patients in hospital 7 having survival probability 
0;, it might be reasonable to expect that estimates of the 0;’s, which represent a sample of 
hospitals, should be related to each other. We shall see that this is achieved in a natural 
way if we use a prior distribution in which the 0;’s are viewed as a sample from a common 
population distribution. A key feature of such applications is that the observed data, yYij, 
with units indexed by i within groups indexed by j, can be used to estimate aspects of 
the population distribution of the 0;’s even though the values of 0; are not themselves 
observed. It is natural to model such a problem hierarchically, with observable outcomes 
modeled conditionally on certain parameters, which themselves are given a probabilistic 
specification in terms of further parameters, known as hyperparameters. Such hierarchical 
thinking helps in understanding multiparameter problems and also plays an important role 
in developing computational strategies. 


Perhaps even more important in practice is that simple nonhierarchical models are usu- 
ally inappropriate for hierarchical data: with few parameters, they generally cannot fit large 
datasets accurately, whereas with many parameters, they tend to ‘overfit’ such data in the 
sense of producing models that fit the existing data well but lead to inferior predictions for 
new data. In contrast, hierarchical models can have enough parameters to fit the data well, 
while using a population distribution to structure some dependence into the parameters, 
thereby avoiding problems of overfitting. As we show in the examples in this chapter, it is 
often sensible to fit hierarchical models with more parameters than there are data points. 

In Section 5.1, we consider the problem of constructing a prior distribution using hierar- 
chical principles but without fitting a formal probability model for the hierarchical structure. 
We first consider the analysis of a single experiment, using historical data to create a prior 
distribution, and then we consider a plausible prior distribution for the parameters of a set 
of experiments. The treatment in Section 5.1 is not fully Bayesian, because, for the purpose 
of simplicity in exposition, we work with a point estimate, rather than a complete joint 
posterior distribution, for the parameters of the population distribution (the hyperparam- 
eters). In Section 5.2, we discuss how to construct a hierarchical prior distribution in the 
context of a fully Bayesian analysis. Sections 5.3-5.4 present a general approach to compu- 
tation with hierarchical models in conjugate families by combining analytical and numerical 
methods. We defer details of the most general computational methods to Part III in order 
to explore immediately the important practical and conceptual advantages of hierarchical 
Bayesian models. The chapter continues with two extended examples: a hierarchical model 
for an educational testing experiment and a Bayesian treatment of the method of ‘meta- 
analysis’ as used in medical research to combine the results of separate studies relating to 
the same research question. We conclude with a discussion of weakly informative priors, 
which become important for hierarchical models fit to data from a small number of groups. 


101 


This electronic edition is for non-commercial purposes only. 


102 5. HIERARCHICAL MODELS 
Previous experiments: 

0/20 0/20 0/20 0/20 0/20 0/20 0/20 0/19 0/19 0/19 

0/19 0/18 0/18 0/17 1/20 1/20 1/20 1/20 1/19 1/19 

1/18 1/18 2/25 2/24 2/23 2/20 2/20 2/20 2/20 2/20 

2/20 1/10 5/49 2/19 5/46 3/27 2/17 7/49 7/47 3/20 

3/20 2/13 9/48 10/50 4/20 4/20 4/20 4/20 4/20 4/20 

4/20 10/48 4/19 4/19 4/19 5/22 11/46 12/49 5/20 5/20 

6/23 5/19 6/22 6/20 6/20 6/20 16/52 15/47 15/46 9/24 


Current experiment: 
4/14 


Table 5.1 Tumor incidence in historical control groups and current group of rats, from Tarone 
(1982). The table displays the values of #1: (number of rats with tumors)/(total number of rats). 
J 


5.1 Constructing a parameterized prior distribution 
Analyzing a single experiment in the context of historical data 


To begin our description of hierarchical models, we consider the problem of estimating a 
parameter 0 using data from a small experiment and a prior distribution constructed from 
similar previous (or historical) experiments. Mathematically, we will consider the current 
and historical experiments to be a random sample from a common population. 


Example. Estimating the risk of tumor in a group of rats 

In the evaluation of drugs for possible clinical application, studies are routinely per- 
formed on rodents. For a particular study drawn from the statistical literature, sup- 
pose the immediate aim is to estimate 0, the probability of tumor in a population of 
female laboratory rats of type ‘F344’ that receive a zero dose of the drug (a control 
group). The data show that 4 out of 14 rats developed endometrial stromal polyps (a 
kind of tumor). It is natural to assume a binomial model for the number of tumors, 
given 0. For convenience, we select a prior distribution for 0 from the conjugate family, 
0 ~ Beta(a, 8). 


Analysis with a fixed prior distribution. From historical data, suppose we knew that 
the tumor probabilities 0 among groups of female lab rats of type F344 follow an 
approximate beta distribution, with known mean and standard deviation. The tumor 
probabilities 0 vary because of differences in rats and experimental conditions among 
the experiments. Referring to the expressions for the mean and variance of the beta 
distribution (see Appendix A), we could find values for aœ, 8 that correspond to the 
given values for the mean and standard deviation. Then, assuming a Beta(a, 3) prior 
distribution for @ yields a Beta(a@ + 4, + 10) posterior distribution for 6. 


Approximate estimate of the population distribution using the historical data. Typ- 
ically, the mean and standard deviation of underlying tumor risks are not available. 
Rather, historical data are available on previous experiments on similar groups of rats. 
In the rat tumor example, the historical data were in fact a set of observations of tu- 
mor incidence in 70 groups of rats (Table 5.1). In the jth historical experiment, let the 
number of rats with tumors be y; and the total number of rats be n;. We model the 
yj’s as independent binomial data, given sample sizes n; and study-specific means 8j. 
Assuming that the beta prior distribution with parameters (a, 3) is a good description 
of the population distribution of the 0;’s in the historical experiments, we can display 
the hierarchical model schematically as in Figure 5.1, with 67, and y71 corresponding 
to the current experiment. 

The observed sample mean and standard deviation of the 70 values Hi are 0.136 and 
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a, p 
0, 02 3 ae abe seb Ozo (O71 
yı Y2 y3 TE re Rare aca Y70 Y71 


Figure 5.1: Structure of the hierarchical model for the rat tumor example. 


0.103. If we set the mean and standard deviation of the population distribution to 
these values, we can solve for a and 8—see (A.3) on page 585 in Appendix A. The 
resulting estimate for (a, 8) is (1.4,8.6). This is not a Bayesian calculation because 
it is not based on any specified full probability model. We present a better, fully 
Bayesian approach to estimating (a, 3) for this example in Section 5.3. The estimate 
(1.4, 8.6) is simply a starting point from which we can explore the idea of estimating 
the parameters of the population distribution. 

Using the simple estimate of the historical population distribution as a prior distribu- 
tion for the current experiment yields a Beta(5.4, 18.6) posterior distribution for 071: 
the posterior mean is 0.223, and the standard deviation is 0.083. The prior informa- 
tion has resulted in a posterior mean substantially lower than the crude proportion, 
4/14 = 0.286, because the weight of experience indicates that the number of tumors 
in the current experiment is unusually high. 

These analyses require that the current tumor risk, 071, and the 70 historical tumor 
risks, 01,...,979, be considered a random sample from a common distribution, an 
assumption that would be invalidated, for example, if it were known that the historical 
experiments were all done in laboratory A but the current data were gathered in 
laboratory B, or if time trends were relevant. In practice, a simple, although arbitrary, 
way of accounting for differences between the current and historical data is to inflate 
the historical variance. For the beta model, inflating the historical variance means 
decreasing (a+ 8) while holding 3 constant. Other systematic differences, such as a 
time trend in tumor risks, can be incorporated in a more extensive model. 


Having used the 70 historical experiments to form a prior distribution for 071, we might 
now like also to use this same prior distribution to obtain Bayesian inferences for the tumor 
probabilities in the first 70 experiments, 01,...,979. There are several logical and practical 
problems with the approach of directly estimating a prior distribution from existing data: 


e If we wanted to use the estimated prior distribution for inference about the first 70 
experiments, then the data would be used twice: first, all the results together are used to 
estimate the prior distribution, and then each experiment’s results are used to estimate 
its 0. This would seem to cause us to overestimate our precision. 


e The point estimate for a and 8 seems arbitrary, and using any point estimate for œ and 
6 necessarily ignores some posterior uncertainty. 


e We can also make the opposite point: does it make sense to ‘estimate’ a and 8 at all? 
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They are part of the ‘prior’ distribution: should they be known before the data are 
gathered, according to the logic of Bayesian inference? 


Logic of combining information 


Despite these problems, it clearly makes more sense to try to estimate the population 
distribution from all the data, and thereby to help estimate each 6;, than to estimate all 71 
values 0; separately. Consider the following thought experiment about inference on two of 
the parameters, 026 and 027, each corresponding to experiments with 2 observed tumors out 
of 20 rats. Suppose our prior distribution for both 625 and 627 is centered around 0.15; now 
suppose that you were told after completing the data analysis that 02g = 0.1 exactly. This 
should influence your estimate of 627; in fact, it would probably make you think that 627 
is lower than you previously believed, since the data for the two parameters are identical, 
and the postulated value of 0.1 is lower than you previously expected for 02g from the prior 
distribution. Thus, 026 and 027 should be dependent in the posterior distribution, and they 
should not be analyzed separately. 

We retain the advantages of using the data to estimate prior parameters and eliminate 
all of the disadvantages just mentioned by putting a probability model on the entire set of 
parameters and experiments and then performing a Bayesian analysis on the joint distribu- 
tion of all the model parameters. A complete Bayesian analysis is described in Section 5.3. 
The analysis using the data to estimate the prior parameters, which is sometimes called 
empirical Bayes, can be viewed as an approximation to the complete hierarchical Bayesian 
analysis. We prefer to avoid the term ‘empirical Bayes’ because it misleadingly suggests 
that the full Bayesian method, which we discuss here and use for the rest of the book, is 
not ‘empirical.’ 


5.2 Exchangeability and hierarchical models 


Generalizing from the example of the previous section, consider a set of experiments 7 = 
1,...,J, in which experiment j has data (vector) yj and parameter (vector) 6;, with like- 
lihood p(y,;|6;). (Throughout this chapter we use the word ‘experiment’ for convenience, 
but the methods can apply equally well to nonexperimental data.) Some of the parameters 
in different experiments may overlap; for example, each data vector y; may be a sample of 
observations from a normal distribution with mean uj and common variance o°, in which 
case 6; = (uj, o°). In order to create a joint probability model for all the parameters 0, we 
use the crucial idea of exchangeability introduced in Chapter 1 and used repeatedly since 
then. 


Exchangeability 


If no information—other than the data y—is available to distinguish any of the 0;’s from any 
of the others, and no ordering or grouping of the parameters can be made, one must assume 
symmetry among the parameters in their prior distribution. This symmetry is represented 
probabilistically by exchangeability; the parameters (61,...,0 7) are exchangeable in their 
joint distribution if p(01,...,0,) is invariant to permutations of the indexes (1,..., J). For 
example, in the rat tumor problem, suppose we have no information to distinguish the 71 
experiments, other than the sample sizes nj, which presumably are not related to the values 
of 0j; we therefore use an exchangeable model for the 0;’s. 

We have already encountered the concept of exchangeability in constructing independent 
and identically distributed models for direct data. In practice, ignorance implies exchange- 
ability. Generally, the less we know about a problem, the more confidently we can make 
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claims of exchangeability. (This is not, we hasten to add, a good reason to limit our knowl- 
edge of a problem before embarking on statistical analysis!) Consider the analogy to a 
roll of a die: we should initially assign equal probabilities to all six outcomes, but if we 
study the measurements of the die and weigh the die carefully, we might eventually notice 
imperfections, which might make us favor one outcome over the others and thus eliminate 
the symmetry among the six outcomes. 

The simplest form of an exchangeable distribution has each of the parameters 6; as an 
independent sample from a prior (or population) distribution governed by some unknown 


parameter vector @; thus, 
J 


p(4\¢) = ] [ 201o). (5.1) 


j=1 


In general, ¢ is unknown, so our distribution for 0 must average over our uncertainty in ¢: 


o= f (Tiree poa. (5.2) 


This form, the mixture of independent identical distributions, is usually all that we need to 
capture exchangeability in practice. 

A related theoretical result, de Finetti’s theorem, to which we alluded in Section 1.2, 
states that in the limit as J — oo, any suitably well-behaved exchangeable distribution 
on (6;,...,4,7) can be expressed as a mixture of independent and identical distributions 
as in (5.2). The theorem does not hold when J is finite (see Exercises 5.1, 5.2, and 5.4). 
Statistically, the mixture model characterizes parameters 0 as drawn from a common ‘su- 
perpopulation’ that is determined by the unknown hyperparameters, ¢. We are already 
familiar with exchangeable models for data, y1,...,Yn, in the form of likelihoods in which 
the n observations are independent and identically distributed, given some parameter vector 
0. 

As a simple counterexample to the above mixture model, consider the probabilities of a 
given die landing on each of its six faces. The probabilities 01, ..., 0s are exchangeable, but 
the six parameters 0; are constrained to sum to 1 and so cannot be modeled with a mixture 
of independent identical distributions; nonetheless, they can be modeled exchangeably. 


Example. Exchangeability and sampling 

The following thought experiment illustrates the role of exchangeability in inference 
from random sampling. For simplicity, we use a nonhierarchical example with ex- 
changeability at the level of y rather than 6. 

We, the authors, have selected eight states out of the United States and recorded the 
divorce rate per 1000 population in each state in 1981. Call these y,,...,yg. What 
can you, the reader, say about yg, the divorce rate in the eighth state? 

Since you have no information to distinguish any of the eight states from the others, 
you must model them exchangeably. You might use a beta distribution for the eight 
y,;’s, a logit normal, or some other prior distribution restricted to the range [0,1]. 
Unless you are familiar with divorce statistics in the United States, your distribution 
on (y1,---, yg) should be fairly vague. 

We now randomly sample seven states from these eight and tell you their divorce 
rates: 5.8, 6.6, 7.8, 5.6, 7.0, 7.1, 5.4, each in numbers of divorces per 1000 population 
(per year). Based primarily on the data, a reasonable posterior (predictive) distri- 
bution for the remaining value, yg, would probably be centered around 6.5 and have 
most of its mass between 5.0 and 8.0. Changing the indexing does not change the 
joint distribution. If we relabel the remaining value to be any other yj the posterior 
estimate would be the same. yj are exchangeable but they are not independent as we 
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assume that the divorce rate in the eighth unobserved state is probably similar to the 
observed rates. 

Suppose initially we had given you the further prior information that the eight states 
are Mountain states: Arizona, Colorado, Idaho, Montana, Nevada, New Mexico, Utah, 
and Wyoming, but selected in a random order; you still are not told which observed 
rate corresponds to which state. Now, before the seven data points were observed, 
the eight divorce rates should still be modeled exchangeably. However, your prior 
distribution (that is, before seeing the data), for the eight numbers should change: 
it seems reasonable to assume that Utah, with its large Mormon population, has a 
much lower divorce rate, and Nevada, with its liberal divorce laws, has a much higher 
divorce rate, than the remaining six states. Perhaps, given your expectation of outliers 
in the distribution, your prior distribution should have wide tails. Given this extra 
information (the names of the eight states), when you see the seven observed values 
and note that the numbers are so close together, it might seem a reasonable guess that 
the missing eighth state is Nevada or Utah. Therefore its value might be expected to 
be much lower or much higher than the seven values observed. This might lead to a 
bimodal or trimodal posterior distribution to account for the two plausible scenarios. 
The prior distribution on the eight values y; is still exchangeable, however, because 
you have no information telling which state corresponds to which index number. (See 
Exercise 5.6.) 

Finally, we tell you that the state not sampled (corresponding to yg) was Nevada. 
Now, even before seeing the seven observed values, you cannot assign an exchangeable 
prior distribution to the set of eight divorce rates, since you have information that 
distinguishes yg from the other seven numbers, here suspecting it is larger than any 
of the others. Once y1,...,y7 have been observed, a reasonable posterior distribution 
for yg plausibly should have most of its mass above the largest observed rate, that is, 
plys > max(yi,.--,Y7)|Y1,---,;Y7) should be large. 

Incidentally, Nevada’s divorce rate in 1981 was 13.9 per 1000 population. 


Exchangeability when additional information is available on the units 


Often observations are not fully exchangeable, but are partially or conditionally exchange- 
able: 


e If observations can be grouped, we may make hierarchical model, where each group has its 
own submodel, but the group properties are unknown. If we assume that group properties 
are exchangeable, we can use a common prior distribution for the group properties. 


e If y; has additional information x; so that y; are not exchangeable but (yi, xi) still are 
exchangeable, then we can make a joint model for (y;,2;) or a conditional model for 
In the rat tumor example, y; were exchangeable as no additional knowledge was available 

on experimental conditions. If we knew that specific batches of experiments were made in 

different laboratories we could assume partial exchangeability and use two level hierarchical 
model to model variation within each laboratory and between laboratories. 

In the divorce example, if we knew xj, the divorce rate in state j last year, for j = 
1,...,8, but not which index corresponded to which state, then we would certainly be able 
to distinguish the eight values of yj, but the joint prior distribution p(a;,y;) would be the 
same for each state. For states having the same last year divorce rates zj, we could use 
grouping and assume partial exchangeability or if there are many possible values for x; (as 
we would assume for divorce rates) we could assume conditional exchangeability and use x; 
as covariate in regression model. 

In general, the usual way to model exchangeability with covariates is through con- 
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ditional independence: p(01,...,0s7|v1,...,U7) = ST p(0;|, x7) |p(dla)dd, with z = 
(a1,...,”7). In this way, exchangeable models become almost universally applicable, be- 


cause any information available to distinguish different units should be encoded in the æ 
and y variables. 

In the rat tumor example, we have already noted that the sample sizes n; are the only 
available information to distinguish the different experiments. It does not seem likely that 
nj would be a useful variable for modeling tumor rates, but if one were interested, one 
could create an exchangeable model for the J pairs (n,y);. A natural first step would be 
to plot vs. nj to see any obvious relation that could be modeled. For example, perhaps 
some studies j had larger sample sizes nj because the investigators correctly suspected rarer 
events; that is, smaller 0; and thus smaller expected values of ora . In fact, the plot of 2i 
versus nj, not shown here, shows no apparent relation between ie two variables. 


Objections to exchangeable models 


In virtually any statistical application, it is natural to object to exchangeability on the 
grounds that the units actually differ. For example, the 71 rat tumor experiments were 
performed at different times, on different rats, and presumably in different laboratories. 
Such information does not, however, invalidate exchangeability. That the experiments differ 
implies that the 0;’s differ, but it might be perfectly acceptable to consider them as if drawn 
from a common distribution. In fact, with no information available to distinguish them, we 
have no logical choice but to model the 6;’s exchangeably. Objecting to exchangeability for 
modeling ignorance is no more reasonable than objecting to an independent and identically 
distributed model for samples from a common population, objecting to regression models 
in general, or, for that matter, objecting to displaying points in a scatterplot without 
individual labels. As with regression, the valid concern is not about exchangeability, but 
about encoding relevant knowledge as explanatory variables where possible. 


The full Bayesian treatment of the hierarchical model 


Returning to the problem of inference, the key ‘hierarchical’ part of these models is that 
@ is not known and thus has its own prior distribution, p(¢). The appropriate Bayesian 
posterior distribution is of the vector (¢, 0). The joint prior distribution is 


p(o, 9) = p(¢)p(4|¢), 


and the joint posterior distribution is 


p(o, Aly) « plo, 
= p(¢,)p(yl@), (5.3) 


with the latter simplification holding because the data distribution, p(y|¢, 0), depends only 
on 6; the hyperparameters ¢ affect y only through 0. Previously, we assumed ¢ was known, 
which is unrealistic; now we include the uncertainty in ¢ in the model. 


The hyperprior distribution 


In order to create a joint probability distribution for (¢,@), we must assign a prior distri- 
bution to @¢. If little is known about ¢, we can assign a diffuse prior distribution, but we 
must be careful when using an improper prior density to check that the resulting poste- 
rior distribution is proper, and we should assess whether our conclusions are sensitive to 
this simplifying assumption. In most real problems, one should have enough substantive 
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knowledge about the parameters in ¢ at least to constrain the hyperparameters into a finite 
region, if not to assign a substantive hyperprior distribution. As in nonhierarchical models, 
it is often practical to start with a simple, relatively noninformative, prior distribution on ¢ 
and seek to add more prior information if there remains too much variation in the posterior 
distribution. 

In the rat tumor example, the hyperparameters are (a, 8), which determine the beta 
distribution for 0. We illustrate one approach to constructing an appropriate hyperprior 
distribution in the continuation of that example in the next section. 


Posterior predictive distributions 


Hierarchical models are characterized both by hyperparameters, ¢, in our notation, and 
parameters 0. There are two posterior predictive distributions that might be of interest to 
the data analyst: (1) the distribution of future observations 7 corresponding to an existing 
0j, or (2) the distribution of observations y corresponding to future 6,;’s drawn from the 
same superpopulation. We label the future 6;’s as 6. Both kinds of replications can be used 
to assess model adequacy, as we discuss in Chapter 6. In the rat tumor example, future 
observations can be (1) additional rats from an existing experiment, or (2) results from a 
future experiment. In the former case, the posterior predictive draws y are based on the 
posterior draws of 0; for the existing experiment. In the latter case, one must first draw 
6 for the new experiment from the population distribution, given the posterior draws of @, 
and then draw y given the simulated 0. 


5.3 Bayesian analysis of conjugate hierarchical models 


Our inferential strategy for hierarchical models follows the general approach to multiparam- 
eter problems presented in Section 3.8 but is more difficult in practice because of the large 
number of parameters that commonly appear in a hierarchical model. In particular, we 
cannot generally plot the contours or display a scatterplot of the simulations from the joint 
posterior distribution of (0, ¢). With care, however, we can follow a similar simulation-based 
approach as before. 

In this section, we present an approach that combines analytical and numerical methods 
to obtain simulations from the joint posterior distribution, p(0, |y), for the beta-binomial 
model for the rat-tumor example, for which the population distribution, p(6|¢), is conjugate 
to the likelihood, p(y|@). For the many nonconjugate hierarchical models that arise in 
practice, more advanced computational methods, presented in Part III of this book, are 
necessary. Even for more complicated problems, however, the approach using conjugate 
distributions is useful for obtaining approximate estimates and starting points for more 
accurate computations. 


Analytic derivation of conditional and marginal distributions 


We first perform the following three steps analytically. 

1. Write the joint posterior density, p(0, |y), in unnormalized form as a product of the 
hyperprior distribution p(¢), the population distribution p(6|¢), and the likelihood p(y|@). 

2. Determine analytically the conditional posterior density of 0 given the hyperparameters 
@; for fixed observed y, this is a function of ¢, p(6|¢, y). 

3. Estimate @ using the Bayesian paradigm; that is, obtain its marginal posterior distribu- 
tion, p(dly). 
The first step is immediate, and the second step is easy for conjugate models because, 

conditional on ¢, the population distribution for 0 is just the independent and identically 
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distributed model (5.1), so that the conditional posterior density is a product of conjugate 
posterior densities for the components 6;. 

The third step can be performed by brute force by integrating the joint posterior distri- 
bution over @: 


v(dly) = I p(0, dly)dð. (5.4) 


For many standard models, however, including the normal distribution, the marginal pos- 
terior distribution of @ can be computed algebraically using the conditional probability 
formula, 


pO, oly) 

(oy) = Bee. (5.5) 
This expression is useful because the numerator is just the joint posterior distribution (5.3), 
and the denominator is the posterior distribution for 0 if @ were known. The difficulty in 
using (5.5), beyond a few standard conjugate models, is that the denominator, p(6|¢, y), 
regarded as a function of both 6 and ¢ for fixed y, has a normalizing factor that depends on 
@ as well as y. One must be careful with the proportionality ‘constant’ in Bayes’ theorem, 
especially when using hierarchical models, to make sure it is actually constant. Exercise 
5.11 has an example of a nonconjugate model in which the integral (5.4) has no closed-form 
solution so that (5.5) is no help. 


Drawing simulations from the posterior distribution 


The following strategy is useful for simulating a draw from the joint posterior distribution, 
p(0, |y), for simple hierarchical models such as are considered in this chapter. 


1. Draw the vector of hyperparameters, ¢, from its marginal posterior distribution, p(¢|y). 
If @ is low-dimensional, the methods discussed in Chapter 3 can be used; for high- 
dimensional ¢, more sophisticated methods such as described in Part III may be needed. 


2. Draw the parameter vector 0 from its conditional posterior distribution, p(6|¢, y), given 
the drawn value of ¢. For the examples we consider in this chapter, the factorization 
pêlo, y) = I, p(9;|¢,y) holds, and so the components 6; can be drawn independently, 
one at a time. 


3. If desired, draw predictive values y from the posterior predictive distribution given the 
drawn 0. Depending on the problem, it might be necessary first to draw a new value 6, 
given @, as discussed at the end of the previous section. 


As usual, the above steps are performed L times in order to obtain a set of L draws. From 
the joint posterior simulations of 0 and y, we can compute the posterior distribution of any 
estimand or predictive quantity of interest. 


Application to the model for rat tumors 


We now perform a full Bayesian analysis of the rat tumor experiments described in Section 
5.1. Once again, the data from experiments j = 1,...,J, J = 71, are assumed to follow 
independent binomial distributions: 


yj ~ Bin(n;, 45), 


with the number of rats, nj, known. The parameters 0; are assumed to be independent 
samples from a beta distribution: 


0; ~ Beta(a, 8), 
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and we shall assign a noninformative hyperprior distribution to reflect our ignorance about 
the unknown hyperparameters. As usual, the word ‘noninformative’ indicates our attitude 
toward this part of the model and is not intended to imply that this particular distribution 
has any special properties. If the hyperprior distribution turns out to be crucial for our 
inference, we should report this and if possible seek further substantive knowledge that 
could be used to construct a more informative prior distribution. If we wish to assign an 
improper prior distribution for the hyperparameters, (a, 3), we must check that the poste- 
rior distribution is proper. We defer the choice of noninformative hyperprior distribution, 
a relatively arbitrary and unimportant part of this particular analysis, until we inspect the 
integrability of the posterior density. 


Joint, conditional, and marginal posterior distributions. We first perform the three steps 
for determining the analytic form of the posterior distribution. The joint posterior distri- 
bution of all parameters is 


p(9, a, Bly) x pla, B)p(Ola, B)p(yl@, a, B) 
J J 
x pla, 8) lI oo — 6;)8-* |] 0% (1 — 0;)7-™. (5.6) 


j=l j=l 


Given (a, 3), the components of 0 have independent posterior densities that are of the form 
o2 — 0;)?—that is, beta densities—and the joint density is 


J 
T'(a+6+n,;) PETTEE tyes 
p(Ala, B,y) = | [z Tern OOTY" (1 — 0; ETT, 5.7 


We can determine the marginal posterior distribution of (aœ, 8) by substituting (5.6) and 
(5.7) into the conditional probability formula (5.5): 


J 


Waray -e D(a + y) (E + nj — ys) 


T(a+8 + nj) G 


pla, Bly) x pla 


The product in equation (5.8) cannot be simplified analytically but is easy to compute for 
any specified values of (a, 8) using a standard routine to compute the gamma function. 


Choosing a standard parameterization and setting up a ‘noninformative’ hyperprior dis- 
tribution. Because we have no immediately available information about the distribution 
of tumor rates in populations of rats, we seek a relatively diffuse hyperprior distribu- 
tion for (a, 3). Before assigning a hyperprior distribution, we reparameterize in terms 
of logit(45) = log(Z) and log(a+ 3), which are the logit of the mean and the logarithm 
of the ‘sample size’ in the beta population distribution for 0. It would seem reasonable to 
assign independent hyperprior distributions to the prior mean and ‘sample size,’ and we 
use the logistic and logarithmic transformations to put each on a (—oo, co) scale. Unfortu- 
nately, a uniform prior density on these newly transformed parameters yields an improper 
posterior density, with an infinite integral in the limit (a+) —> ov, and so this particular 
prior density cannot be used here. 

In a problem such as this with a reasonably large amount of data, it is possible to set up a 
‘noninformative’ hyperprior density that is dominated by the likelihood and yields a proper 
pot distribution. One reasonable choice of diffuse hyperprior density is uniform on 
(ora (a+8) 1/2) which when multiplied by the appropriate Jacobian yields the following 
densities on the original scale, 


pla, B) x (a+8)-*?, (5.9) 
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Figure 5.2 First try at a contour plot of the marginal posterior density of (log(%), log(a+8)) for 
the rat tumor example. Contour lines are at 0.05, 0.15,...,0.95 times the density at the mode. 


and on the natural transformed scale: 


has 
p 


See Exercise 5.9 for a discussion of this prior density. 

We could avoid the mathematical effort of checking the integrability of the posterior 
density if we were to use a proper hyperprior distribution. Another approach would be 
tentatively to use a flat hyperprior density, such as p(y, a+) x 1, or even p(a, 8) « 1, 
and then compute the contours and simulations from the posterior density (as detailed 
below). The result would clearly show the posterior contours drifting off toward infinity, 
indicating that the posterior density is not integrable in that limit. The prior distribution 
would then have to be altered to obtain an integrable posterior density. 

Incidentally, setting the prior distribution for (log(),log(a+ )) to uniform in a vague 
but finite range, such as [—10!°, 101°] x [-101°, 101°], would not be an acceptable solution 
for this problem, as almost all the posterior mass in this case would be in the range of a 
and 8 near ‘infinity,’ which corresponds to a Beta(a, 3) distribution with a variance of zero, 
meaning that all the 0j parameters would be essentially equal in the posterior distribution. 
When the likelihood is not integrable, setting a faraway finite cutoff to a uniform prior 
density does not necessarily eliminate the problem. 


p (10g ) og(a-+8) x abla +6) 5. (5.10) 


Computing the marginal posterior density of the hyperparameters. Now that we have estab- 
lished a full probability model for data and parameters, we compute the marginal posterior 
distribution of the hyperparameters. Figure 5.2 shows a contour plot of the unnormalized 
marginal posterior density on a grid of values of (log($), log(a+)). To create the plot, we 
first compute the logarithm of the density function (5.8) with prior density (5.9), multiply- 
ing by the Jacobian to obtain the density p(log($),log(a+)|y). We set a grid in the range 
(log(F), log(a+)) € [-2.5, —1] x [1.5,3], which is centered near our earlier point estimate 
(—1.8, 2.3) (that is, (a, 8) = (1.4, 8.6)) and covers a factor of 4 in each parameter. Then, to 
avoid computational overflows, we subtract the maximum value of the log density from each 
point on the grid and exponentiate, yielding values of the unnormalized marginal posterior 
density. 

The most obvious features of the contour plot are (1) the mode is not far from the 
point estimate (as we would expect), and (2) important parts of the marginal posterior 
distribution lie outside the range of the graph. 

We recompute p(log(),log(a+ 8)|y), this time in the range (log(3), log(a + 8)) € 
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Figure 5.3 (a) Contour plot of the marginal posterior density of (log(), log(a+8)) for the rat tumor 
example. Contour lines are at 0.05,0.15,...,0.95 times the density at the mode. (b) Scatterplot of 
1000 draws (log(F), log(a+)) from the numerically computed marginal posterior density. 


[—2.3, —1.3] x [1,5]. The resulting grid, shown in Figure 5.3a, displays essentially all of 
the marginal posterior distribution. Figure 5.3b displays 1000 random draws from the 
numerically computed posterior distribution. The graphs show that the marginal poste- 
rior distribution of the hyperparameters, under this transformation, is approximately sym- 
metric about the mode, roughly (—1.75,2.8). This corresponds to approximate values of 
(a, B) = (2.4, 14.0), which differs somewhat from the crude estimate obtained earlier. 

Having computed the relative posterior density at a grid that covers the effective range 
of (a, 8), we normalize by approximating the distribution as a step function over the grid 
and setting the total probability in the grid to 1. 

We can then compute posterior moments based on the grid of (log($), log(a+)); for 
example, 


E(aly) is estimated by `X a- p(log(5),log(a-+8)Iy). 
log( $) Jog (ot) 


From the grid in Figure 5.3, we compute E(aly) = 2.4 and E(S|y) = 14.3. This is close to the 

estimate based on the mode of Figure 5.3a, given above, because the posterior distribution is 

approximately symmetric on the scale of (log(), log(a+)). A more important consequence 
of averaging over the grid is to account for the posterior uncertainty in (a, 8), which is not 
captured in the point estimate. 

Sampling from the joint posterior distribution of parameters and hyperparameters. We 

draw 1000 random samples from the joint posterior distribution of (a, 8,01,...,07), as 

follows. 

1. Simulate 1000 draws of (log($),log(a+)) from their posterior distribution displayed 
in Figure 5.3, using the same discrete-grid sampling procedure used to draw (a, 3) for 
Figure 3.3b in the bioassay example of Section 3.8. 

2. For! =1,..., 1000: 


(a) Transform the lth draw of (log($), log(at B)) to the scale (a, 8) to yield a draw of 
the hyperparameters from their marginal posterior distribution. 

(b) For each j = 1,..., J, sample 6; from its conditional posterior distribution, 6;|a, 6, y ~ 
Beta(a + yj, 8 +n; — yj). 


Displaying the results. Figure 5.4 shows posterior medians and 95% intervals for the 0;’s, 


computed by simulation. The rates @; are shrunk from their sample point estimates, Hi, 
I 
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Figure 5.4 Posterior medians and 95% intervals of rat tumor rates, 0; (plotted vs. observed tumor 
rates yj/nj), based on simulations from the joint posterior distribution. The 45° line corresponds 
to the unpooled estimates, 6; = yi/ni. The horizontal positions of the line have been jittered to 
reduce overlap. 


towards the population distribution, with approximate mean 0.14; experiments with fewer 
observations are shrunk more and have higher posterior variances. The results are superfi- 
cially similar to what would be obtained based on a point estimate of the hyperparameters, 
which makes sense in this example, because of the fairly large number of experiments. 
But key differences remain, notably that posterior variability is higher in the full Bayesian 
analysis, reflecting posterior uncertainty in the hyperparameters. 


5.4 Normal model with exchangeable parameters 


We now present a full treatment of a simple hierarchical model based on the normal distribu- 
tion, in which observed data are normally distributed with a different mean for each ‘group’ 
or ‘experiment,’ with known observation variance, and a normal population distribution 
for the group means. This model is sometimes termed the one-way normal random-effects 
model with known data variance and is widely applicable, being an important special case 
of the hierarchical normal linear model, which we treat in some generality in Chapter 15. 
In this section, we present a general treatment following the computational approach of 
Section 5.3. The following section presents a detailed example; those impatient with the 
algebraic details may wish to look ahead at the example for motivation. 


The data structure 


Consider J independent experiments, with experiment j estimating the parameter 0; from 
nj independent normally distributed data points, y;;, each with known error variance o°; 
that is, 


Yijlðj ~ N(0;,07), fori = Diese g MGS 3 = i ears A (5.11) 


Using standard notation from the analysis of variance, we label the sample mean of each 
group j as 


with sampling variance 
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We can then write the likelihood for each 6; using the sufficient statistics, y ;: 
F510; ~ N(9;,03), (5.12) 


a notation that will prove useful later because of the flexibility in allowing a separate 
variance oF for the mean of each group j. For the rest of this chapter, all expressions will 
be implicitly conditional on the known values oF. The problem of estimating a set of means 
with unknown variances will require some additional computational methods, presented in 
Sections 11.6 and 13.6. Although rarely strictly true, the assumption of known variances 
at the sampling level of the model is often an adequate approximation. 

The treatment of the model provided in this section is also appropriate for situations 
in which the variances differ for reasons other than the number of data points in the ex- 
periment. In fact, the likelihood (5.12) can appear in much more general contexts than 
that stated here. For example, if the group sizes nj are large enough, then the means 7 , 
are approximately normally distributed, given 0j, even when the data y;; are not. Other 
applications where the actual likelihood is well approximated by (5.12) appear in the next 
two sections. 


Constructing a prior distribution from pragmatic considerations 


Rather than considering immediately the problem of specifying a prior distribution for the 
parameter vector 0 = (01,...,07), let us consider what sorts of posterior estimates might 
be reasonable for 0, given data (yi;). A simple natural approach is to estimate 0; by 7. j» the 
average outcome in experiment j. But what if, for example, there are J = 20 experiments 
with only n; = 2 observations per experimental group, and the groups are 20 pairs of 
assays taken from the same strain of rat, under essentially identical conditions? The two 
observations per group do not permit accurate estimates. Since the 20 groups are from the 
same strain of rat, we might now prefer to estimate each 6; by the pooled estimate, 


J J 
= Ži ZT (5.13) 
Y ==] ar f 
ee a 


To decide which estimate to use, a traditional approach from classical statistics is to 
perform an analysis of variance F test for differences among means: if the J group means 
appear significantly variable, choose separate sample means, and if the variance between 
the group means is not significantly greater than what could be explained by individual 
variability within groups, use Y _. The theoretical analysis of variance table is as follows, 
where 7? is the variance of 0,,...,0 7. For simplicity, we present the analysis of variance for 
a balanced design in which nj = n and of = o? /n for all j. 


Between groups J =l ida Vi U) SS/(J—- 1) ntrf+o0 
Within groups J(n-1) X, Dj (Mis -9,;)? SS/(J(n — 1)) o 
Total Jn-1 >}, Dj (yas -7.)? SS/(Jn -1) 


In the classical random-effects analysis of variance, one computes the sum of squares (SS) 
and the mean square (MS) columns of the table and uses the ‘between’ and ‘within’ mean 
squares to estimate 7. If the ratio of between to within mean squares is significantly greater 
than 1, then the analysis of variance suggests separate estimates, 6; = J.j for each j. If 
the ratio of mean squares is not ‘statistically significant,’ then the F test cannot ‘reject the 
hypothesis’ that 7 = 0, and pooling is reasonable: 6; = 7%, for all j. We discuss Bayesian 
analysis of variance in Section 15.6 in the context of hierarchical regression models. 
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But we are not forced to choose between complete pooling and none at all. An alternative 
is to use a weighted combination: 
0; = Ajj + (1 A)T.. 
where A; is between 0 and 1. 
What kind of prior models produce these various posterior estimates? 
1. The unpooled estimate Ê; = J; is the posterior mean if the J values 0; have independent 
uniform prior densities on (—0oo, 00). 
2. The pooled estimate Ô = y.. is the posterior mean if the J values 6; are restricted to be 
equal, with a uniform prior density on the common @. 
3. The weighted combination is the posterior mean if the J values 6; have independent and 
identically distributed normal prior densities. 
All three of these options are exchangeable in the 6;’s, and options 1 and 2 are special cases 
of option 3. No pooling corresponds to A; = 1 for all j and an infinite prior variance for 


the @;’s, and complete pooling corresponds to A; = 0 for all j and a zero prior variance for 
the 0,’s. 


The hierarchical model 


For the convenience of conjugacy (more accurately, partial conjugacy), we assume that the 
parameters 6; are drawn from a normal distribution with hyperparameters (u, T): 


te 


PO iy. .-,0J|u,T) = N(8; |u, T? (5.14) 


a 
ai = I IN (@j|p6572)] pl, 7), 7). 


That is, the 0;’s are conditionally independent given (u,T). The hierarchical model also 
permits the interpretation of the 6;’s as a random sample from a shared population distri- 
bution, as illustrated in Figure 5.1 for the rat tumors. 

We assign a noninformative uniform hyperprior distribution to u, given T: 


plu, T) = p(uļT)p(T) x p(T). (5.15) 


The uniform prior density for u is generally reasonable for this problem; because the com- 
bined data from all J experiments are generally highly informative about u, we can afford 
to be vague about its prior distribution. We defer discussion of the prior distribution of 
T to later in the analysis, although relevant principles have already been discussed in the 
context of the rat tumor example. As usual, we first work out the answer conditional on 
the hyperparameters and then consider their prior and posterior distributions. 


The joint posterior distribution 


Combining the sampling model for the observable y;;’s and the prior distribution yields 
the joint posterior distribution of all the parameters and hyperparameters, which we can 
express in terms of the sufficient statistics, 7 ;: 


P(O,u,TIy) œ~ plu, T)p(0lu, 7)p(yl9) 
J 


T) M N(O;\u,77) |] NT; 02), (5.16) 


j=l 


S 


This electronic edition is for non-commercial purposes only. 


116 5. HIERARCHICAL MODELS 


where we can ignore factors that depend only on y and the parameters gj, which are assumed 
known for this analysis. 


The conditional posterior distribution of the normal means, given the hyperparameters 


As in the general hierarchical structure, the parameters 0; are independent in the prior 
distribution (given u and T) and appear in different factors in the likelihood (5.11); thus, 
the conditional posterior distribution p(@|u,7, y) factors into J components. 

Conditional on the hyperparameters, we simply have J independent unknown normal 
means, given normal prior distributions, so we can use the methods of Section 2.5 inde- 
pendently on each 6;. The conditional posterior distributions for the 6;’s are independent, 
and . 

bjir, y ~ N(Ê;, Vi); 


where Ta i 
` aru j tau 1 
0; = T I and V; = EE (5.17) 
ae ae 


The posterior mean is a precision-weighted average of the prior population mean and the 
sample mean of the jth group; these expressions for 6; and V; are functions of u and T as 
well as the data. The conditional posterior density for each 6; given u, T is proper. 


The marginal posterior distribution of the hyperparameters 


The solution so far is only partial because it depends on the unknown p and T. The next step 
in our approach is a full Bayesian treatment for the hyperparameters. Section 5.3 mentions 
integration or analytic computation as two approaches for obtaining p(y, T|y) from the joint 
posterior density p(6,u,7|y). For the hierarchical normal model, we can simply consider 
the information supplied by the data about the hyperparameters directly: 


p(n, Tly) x plu, T)plylu, T). 


For many problems, this decomposition is no help, because the ‘marginal likelihood’ factor, 
plyļu, T), cannot generally be written in closed form. For the normal distribution, however, 
the marginal likelihood has a particularly simple form. The marginal distributions of the 
group means 7 ;, averaging over 0, are independent (but not identically distributed) normal: 


F jl T m N(p, 03 + Te): 


Thus we can write the marginal posterior density as 
J 
plu, tly) x plu, T) || NG; o? +77). (5.18) 
j=1 


Posterior distribution of u given r. We could use (5.18) to compute directly the posterior 
distribution p(u, T|y) as a function of two variables and proceed as in the rat tumor example. 
For the normal model, however, we can further simplify by integrating over u, leaving a 
simple univariate numerical computation of p(r|y). We factor the marginal posterior density 
of the hyperparameters as we did the prior density (5.15): 


plu, Tly) = P(ult, y)p(Tly)- (5.19) 


The first factor on the right side of (5.19) is just the posterior distribution of u if r were 
known. From inspection of (5.18) with 7 assumed known, and with a uniform conditional 
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prior density p(u|r), the log posterior distribution is found to be quadratic in u; thus, 
p(ulT,y) must be normal. The mean and variance of this distribution can be obtained 
immediately by considering the group means 7; as J independent estimates of u with 
variances (oj + 7°). Combining the data with the uniform prior density p(u|r) yields 


ult,y ~ N(A, Va), 


where ĝ is the precision-weighted average of the 7 ;-values, and Vg! is the total precision: 


Se ad j 4 1 
p= ad Y'= 5 =. (5.20) 
Dass mes 
j=l garre j=1 J 


The result is a proper posterior density for u, given T. 


Posterior distribution of T. We can now obtain the posterior distribution of 7 analyti- 
cally from (5.19) and substitution of (5.18) and (5.20) for the numerator and denominator, 
respectively: 


kE p(n, Tly) 
P(ulT,y) 
P(t) Mj- NG ltt. 03 +77) 
NGJA, Vi) 


This identity must hold for any value of yu (in other words, all the factors of u must cancel 
when the expression is simplified); in particular, it holds if we set u to jf, which makes 
evaluation of the expression simple: 
J ja 
p(T) Lj- NG; If, o3 +°) 
N(AlÂ, Va) 


1/2 7 2 2\—1/2 (Tj = ji)? 
x p(r)V, (0; +7°) exp | -——,—— ] , (5.21) 


ply) x 


with Ê and V, defined in (5.20). Both expressions are functions of 7, which means that 
p(T|y) is a complicated function of T. 


Prior distribution for r. To complete our analysis, we must assign a prior distribution to 
T. For convenience, we use a diffuse noninformative prior density for 7 and hence must 
examine the resulting posterior density to ensure it has a finite integral. For our illustrative 
analysis, we use the uniform prior distribution, p(T) x 1. We leave it as an exercise to show 
mathematically that the uniform prior density for 7 yields a proper posterior density and 
that, in contrast, the seemingly reasonable ‘noninformative’ prior distribution for a variance 
component, p(log7) « 1, yields an improper posterior distribution for r. Alternatively, in 
applications it involves little extra effort to determine a ‘best guess’ and an upper bound 
for the population variance 7, and a reasonable prior distribution can then be constructed 
from the scaled inverse-y? family (the natural choice for variance parameters), matching the 
‘best guess’ to the mean of the scaled inverse-y? density and the upper bound to an upper 
percentile such as the 99th. Once an initial analysis is performed using the noninformative 
‘uniform’ prior density, a sensitivity analysis with a more realistic prior distribution is often 
desirable. 
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Computation 


For this model, computation of the posterior distribution of 0 is most conveniently performed 
via simulation, following the factorization used above: 


p(0, u, Tly) = p(tly)p(ulT, y)plOlu, T, y). 


The first step, simulating T, is easily performed numerically using the inverse cdf method 
(see Section 1.9) on a grid of uniformly spaced values of 7, with p(T|y) computed from 
(5.21). The second and third steps, simulating u and then 6, can both be done easily by 
sampling from normal distributions, first (5.20) to obtain u and then (5.17) to obtain the 
0;’s independently. 


Posterior predictive distributions 


Sampling from the posterior predictive distribution of new data, either from a current or 
new batch, is straightforward given draws from the posterior distribution of the parameters. 
We consider two scenarios: (1) future data y from the current set of batches, with means 
6 = (0;,...,07), and (2) future data 7 from J future batches, with means 6 = (6;,... 50s). 
In the latter case, we must also specify the J individual sample sizes fij for the future 
batches. 

To obtain a draw from the posterior predictive distribution of new data y from the 
current batch of parameters, 6, first obtain a draw from p(0,u,T|y) and then draw the 
predictive data y from (5.11). 

To obtain posterior predictive simulations of new data y for J new groups, perform the 
following three steps: first, draw (4,7) from their posterior distribution; second, draw J 
new parameters 6 = (61, ie 03) from the population distribution p(0;|u, T), which is the 
population, or prior, distribution for 0 given the hyperparameters (equation (5.14)); and 
third, draw g given ĝ from the data distribution (5.11). 


Difficulty with a natural non-Bayesian estimate of the hyperparameters 


To see some advantages of our fully Bayesian approach, we compare it to an approximate 
method that is sometimes used based on a point estimate of u and 7 from the data. Unbiased 
point estimates, derived from the analysis of variance presented earlier, are 


= y, 
#2 = (MSg —MSyw)/n. (5.22) 


N 


The terms MSpg and MSw are the ‘between’ and ‘within’ mean squares, respectively, from 
the analysis of variance. In this alternative approach, inference for 0;,...,8 7 is based on 
the conditional posterior distribution, p(@|fi,7), given the point estimates. 

As we saw in the rat tumor example of the previous section, the main problem with 
substituting point estimates for the hyperparameters is that it ignores our real uncertainty 
about them. The resulting inference for 0 cannot be interpreted as a Bayesian posterior 
summary. In addition, the estimate 7? in (5.22) has the flaw that it can be negative! The 
problem of a negative estimate for a variance component can be avoided by setting 7? to 
zero in the case that MSy exceeds MSz, but this creates new issues. Estimating T? = 0 
whenever MSw > MSp seems too strong a claim: if MSw > MSp, then the sample size is 
too small for 7? to be distinguished from zero, but this is not the same as saying we know 
that r? = 0. The latter claim, made implicitly by the point estimate, implies that all the 
group means 6; are absolutely identical, which leads to scientifically indefensible claims, as 
we shall see in the example in the next section. It is possible to construct a point estimate 
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of (u,7) to avoid this particular difficulty, but it would still have the problem, common to 
all point estimates, of ignoring uncertainty. 


5.5 Example: parallel experiments in eight schools 


We illustrate the hierarchical normal model with a problem in which the Bayesian analysis 
gives conclusions that differ in important respects from other methods. 

A study was performed for the Educational Testing Service to analyze the effects of 
special coaching programs on test scores. Separate randomized experiments were performed 
to estimate the effects of coaching programs for the SAT-V (Scholastic Aptitude Test- 
Verbal) in each of eight high schools. The outcome variable in each study was the score on 
a special administration of the SAT-V, a standardized multiple choice test administered by 
the Educational Testing Service and used to help colleges make admissions decisions; the 
scores can vary between 200 and 800, with mean about 500 and standard deviation about 
100. The SAT examinations are designed to be resistant to short-term efforts directed 
specifically toward improving performance on the test; instead they are designed to reflect 
knowledge acquired and abilities developed over many years of education. Nevertheless, 
each of the eight schools in this study considered its short-term coaching program to be 
successful at increasing SAT scores. Also, there was no prior reason to believe that any of 
the eight programs was more effective than any other or that some were more similar in 
effect to each other than to any other. 

The results of the experiments are summarized in Table 5.2. All students in the ex- 
periments had already taken the PSAT (Preliminary SAT), and allowance was made for 
differences in the PSAT-M (Mathematics) and PSAT-V test scores between coached and 
uncoached students. In particular, in each school the estimated coaching effect and its 
standard error were obtained by an analysis of covariance adjustment (that is, a linear 
regression was performed of SAT-V on treatment group, using PSAT-M and PSAT-V as 
control variables) appropriate for a completely randomized experiment. A separate regres- 
sion was estimated for each school. Although not simple sample means (because of the 
covariance adjustments), the estimated coaching effects, which we label y;, and their sam- 
pling variances, o$, play the same role in our model as y; and o? in the previous section. 
The estimates y; are obtained by independent experiments and have approximately normal 
sampling distributions with sampling variances that are known, for all practical purposes, 
because the sample sizes in all of the eight experiments were relatively large, over thirty 
students in each school (recall the discussion of data reduction in Section 4.1). Incidentally, 
an increase of eight points on the SAT-V corresponds to about one more test item correct. 


Inferences based on nonhierarchical models and their problems 


Before fitting the hierarchical Bayesian model, we first consider two simpler nonhierarchical 
methods—estimating the effects from the eight experiments independently, and complete 
pooling—and discuss why neither of these approaches is adequate for this example. 


Separate estimates. A cursory examination of Table 5.2 may at first suggest that some 
coaching programs have moderate effects (in the range 18-28 points), most have small 
effects (0-12 points), and two have small negative effects; however, when we take note 
of the standard errors of these estimated effects, we see that it is difficult statistically 
to distinguish between any of the experiments. For example, treating each experiment 
separately and applying the simple normal analysis in each yields 95% posterior intervals 
that all overlap substantially. 


A pooled estimate. The general overlap in the posterior intervals based on independent 
analyses suggests that all experiments might be estimating the same quantity. Under the 
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Estimated Standard error 
treatment of effect 

School effect, yj estimate, Cj 
A B o B 
B 8 10 
C —3 16 
D 7 11 
E —1 9 
F 1 11 
G 18 10 
H 12 18 


Table 5.2 Observed effects of special preparation on SAT-V scores in eight randomized experiments. 
Estimates are based on separate analyses for the eight experiments. 


hypothesis that all experiments have the same effect and produce independent estimates 
of this common effect, we could treat the data in Table 5.2 as eight normally distributed 
observations with known variances. With a noninformative prior distribution, the posterior 
mean for the common coaching effect in the schools is Y, , as defined in equation (5.13) with 


yj in place of y ;. This pooled estimate is 7.7, and the posterior variance is 4 4)! = 
ei 


16.6 because the eight experiments are independent. Thus, we would estimate the common 
effect to be 7.7 points with standard error equal to v 16.6 = 4.1, which would lead to the 
95% posterior interval [—0.5, 15.9], or approximately [8 + 8]. Supporting this analysis, the 


classical test of the hypothesis that all 0;’s are estimating the same quantity yields a x? 


statistic less than its degrees of freedom (seven, in this case): ya —¥_)?/0? = 4.6. To 
put it another way, the estimate 7? from (5.22) is negative. 

Would it be possible to have one school’s observed effect be 28 just by chance, if the 
coaching effects in all eight schools were really the same? To get a feeling for the natural 
variation that we would expect across eight studies if this assumption were true, suppose 
the estimated treatment effects are eight independent draws from a normal distribution 
with mean 8 points and standard deviation 13 points (the square root of the mean of the 
eight variances o3). Then, based on the expected values of normal order statistics, we 
would expect the largest observed value of yj to be about 26 points and the others, in 
diminishing order, to be about 19, 14, 10, 6, 2, —3, and —9 points. These expected effect 
sizes are consistent with the set of observed effect sizes in Table 5.2. Thus, it would appear 
imprudent to believe that school A really has an effect as large as 28 points. 


Difficulties with the separate and pooled estimates. To see the problems with the two ex- 
treme attitudes—the separate analyses that consider each 0; separately, and the alternative 
view (a single common effect) that leads to the pooled estimate—consider 01, the effect in 
school A. The effect in school A is estimated as 28.4 with a standard error of 14.9 under 
the separate analysis, versus a pooled estimate of 7.7 with a standard error of 4.1 under 
the common-effect model. The separate analyses of the eight schools imply the following 
posterior statement: ‘the probability is 4 that the true effect in A is more than 28.4,’ a 
doubtful statement, considering the results for the other seven schools. On the other hand, 
the pooled model implies the following statement: ‘the probability is 4 that the true effect 
in A is less than 7.7,’ which, despite the non-significant y? test, seems an inaccurate sum- 
mary of our knowledge. The pooled model also implies the statement: ‘the probability is 4 
that the true effect in A is less than the true effect in C,’ which also is difficult to justify 
given the data in Table 5.2. As in the theoretical discussion of the previous section, neither 
estimate is fully satisfactory, and we would like a compromise that combines information 
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Figure 5.5 Marginal posterior density, p(T|y), for standard deviation of the population of school 
effects 0; in the educational testing example. 


from all eight experiments without assuming all the @;’s to be equal. The Bayesian analysis 
under the hierarchical model provides exactly that. 


Posterior simulation under the hierarchical model 


Consequently, we compute the posterior distribution of 01,...,03, based on the normal 
model presented in Section 5.4. (More discussion of the reasonableness of applying this 
model in this problem appears in Sections 6.5 and 17.4.) We draw from the posterior 
distribution for the Bayesian model by simulating the random variables 7, u, and @, in that 
order, from their posterior distribution, as discussed at the end of the previous section. The 
sampling standard deviations, gj, are assumed known and equal to the values in Table 5.2, 
and we assume independent uniform prior densities on u and T. 


Results 


The marginal posterior density function, p(r|y) from (5.21), is plotted in Figure 5.5. Values 
of 7 near zero are most plausible; zero is the most likely value, values of 7 larger than 10 
are less than half as likely as Tr = 0, and Pr(r > 25) ~ 0. Inference regarding the marginal 
distributions of the other model parameters and the joint distribution are obtained from the 
simulated values. Illustrations are provided in the discussion that follows this section. In 
the normal hierarchical model, however, we learn a great deal by considering the conditional 
posterior distributions given 7 (and averaged over p). 

The conditional posterior means E(6,|T, y) (averaging over u) are displayed as functions 
of 7 in Figure 5.6; the vertical axis displays the scale for the 0;’s. Comparing Figure 5.6 
to Figure 5.5, which has the same scale on the horizontal axis, we see that for most of the 
likely values of 7, the estimated effects are relatively close together; as r becomes larger, 
corresponding to more variability among schools, the estimates become more like the raw 
values in Table 5.2. 

The lines in Figure 5.7 show the conditional standard deviations, sd(0;|r, y), as a func- 
tion of Tr. As T increases, the population distribution allows the eight effects to be more 
different from each other, and hence the posterior uncertainty in each individual 0; increases, 
approaching the standard deviations in Table 5.2 in the limit of r — oo. (The posterior 
means and standard deviations for the components 6;, given T, are computed using the 
mean and variance formulas (2.7) and (2.8), averaging over ju; see Exercise 5.12.) 

The general conclusion from an examination of Figures 5.5-5.7 is that an effect as large 
as 28.4 points in any school is unlikely. For the likely values of 7, the estimates in all 
schools are substantially less than 28 points. For example, even at T = 10, the probability 
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Estimated Treatment Effects 
-5 0 5 10 15 20 25 30 


Figure 5.6 Conditional posterior means of treatment effects, E(0;|T, y), as functions of the between- 
school standard deviation T, for the educational testing example. The line for school C crosses the 
lines for E and F because C has a higher measurement error (see Table 5.2) and its estimate is 
therefore shrunk more strongly toward the overall mean in the Bayesian analysis. 


Posterior Standard Deviations 
10 


0 5 10 15 20 25 30 


Figure 5.7 Conditional posterior standard deviations of treatment effects, sd(0;|T,y), as functions 
of the between-school standard deviation T, for the educational testing example. 


that the effect in school A is less than 28 points is ®[(28 — 14.5)/9.1] = 93%, where ® is 
the standard normal cumulative distribution function; the corresponding probabilities for 
the effects being less than 28 points in the other schools are 99.5%, 99.2%, 98.5%, 99.96%, 
99.8%, 97%, and 98%. 

Of substantial importance, we do not obtain an accurate summary of the data if we 
condition on the posterior mode of r. The technique of conditioning on a modal value (for 
example, the maximum likelihood estimate) of a hyperparameter such as 7 is often used 
in practice (at least as an approximation), but it ignores the uncertainty conveyed by the 
posterior distribution of the hyperparameter. At 7 = 0, the inference is that all experiments 
have the same size effect, 7.7 points, and the same standard error, 4.1 points. Figures 5.5— 
5.7 certainly suggest that this answer represents too much pulling together of the estimates 
in the eight schools. The problem is especially acute in this example because the posterior 
mode of 7 is on the boundary of its parameter space. A joint posterior modal estimate of 
(0,...,97, 4,7) suffers from even worse problems in general. 


Discussion 


Table 5.3 summarizes the 200 simulated effect estimates for all eight schools. In one sense, 
these results are similar to the pooled 95% interval [8 + 8], in that the eight Bayesian 95% 
intervals largely overlap and are median-centered between 5 and 10. In a second sense, 
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School Posterior quantiles 
2.5% 25% median 75% 97.5% 
A —2 7 10 16 31 
B —5 3 8 12 23 
C —11 2 7 11 19 
D —7 4 8 11 21 
E -9 1 5 10 18 
F —7 2 6 10 28 
G —1 7 10 15 26 
H —6 3 8 13 33 


Table 5.3: Summary of 200 simulations of the treatment effects in the eight schools. 


0 2 40 60 2 0 2 40 60 
Effect in School A Largest Effect 
Figure 5.8 Histograms of two quantities of interest computed from the 200 simulation draws: (a) 


the effect in school A, 01; (b) the largest effect, max{0;}. The jaggedness of the histograms is just 
an artifact caused by sampling variability from using only 200 random draws. 


the results in the table differ from the pooled estimate in a direction toward the eight 
independent answers: the 95% Bayesian intervals are each almost twice as wide as the one 
common interval and suggest substantially greater probabilities of effects larger than 16 
points, especially in school A, and greater probabilities of negative effects, especially in 
school C. If greater precision were required in the posterior intervals, one could simulate 
more simulation draws; we use only 200 draws here to illustrate that a small simulation 
gives adequate inference for many practical purposes. 

The ordering of the effects in the eight schools as suggested by Table 5.3 is essentially the 
same as would be obtained by the eight separate estimates. However, there are differences 
in the details; for example, the Bayesian probability that the effect in school A is as large 
as 28 points is less than 10%, which is substantially less than the 50% probability based on 
the separate estimate for school A. 

As an illustration of the simulation-based posterior results, 200 simulations of school 
A’s effect are shown in Figure 5.8a. Having simulated the parameter 0, it is easy to ask 
more complicated questions of this model. For example, what is the posterior distribution 
of max{6;}, the effect of the most successful of the eight coaching programs? Figure 5.8b 
displays a histogram of 200 values from this posterior distribution and shows that only 22 
draws are larger than 28.4; thus, Pr(max{0;} > 28.4) ~ 3%. Since Figure 5.8a gives the 
marginal posterior distribution of the effect in school A, and Figure 5.8b gives the marginal 
posterior distribution of the largest effect no matter which school it is in, the latter figure has 
larger values. For another example, we can estimate Pr(@, > @3|y), the posterior probability 
that the coaching program is more effective in school A than in school C, by the proportion 
of simulated draws of 0 for which 6; > 63; the result is Hl = 0.705. 

To sum up, the Bayesian analysis of this example not only allows straightforward infer- 
ences about many parameters that may be of interest, but the hierarchical model is flexible 
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Raw data Log- Posterior quantiles of effect 6; 
(deaths/total) odds, °°’ normal approx. (on log-odds scale) 
j Control Treated Yj oj 2.5% 25% median 75% 97.5% 
1 3/39 3/38 0.028 0.850 —0.57 —0.33 —0.24 —0.16 0.12 
2 14/116 7/114 —0.741 0.483 —0.64 —0.37 —0.28 —0.20 —0.00 
3 11/93 5/69 —0.541 0.565 —0.60 —0.35 —0.26 —0.18 0.05 
4 127/1520 102/1533 —0.246 0.138 —0.45 —0.31 —0.25 —0.19 —0.05 
5 
6 


Study, 


27/365 28/355 0.069 0.281 —0.43 —0.28 —0.21 —0.11 0.15 
6/52 4/59 —0.584 0.676 —0.62 —0.35 —0.26 —0.18 0.05 

7 152/939 98/945 —0.512 0.139 —0.61 —0.43 —0.36 —0.28 —0.17 
8 48/471 60/632 —0.079 0.204 —0.43 —0.28 —0.21 —0.13 0.08 
9 37/282 25/278 —0.424 0.274 —0.58 —0.36 —0.28 —0.20 —0.02 
10 188/1921 138/1916 —0.335 0.117 —0.48 —0.35 —0.29 —0.23 —0.13 
11 52/583 64/873 —0.213 0.195 —0.48 —0.31 —0.24 —0.17 0.01 
12 47/266 45/263 —0.039 0.229 —0.43 —0.28 —0.21 —0.12 0.11 
13 16/293 9/291 —0.593 0.425 —0.63 —0.36 —0.28 —0.20 0.01 
14 45/883 57/858 0.282 0.205 —0.34 —0.22 —0.12 0.00 0.27 
15 31/147 25/154 —0.321 0.298 —0.56 —0.34 —0.26 —0.19 0.01 
16 38/213 33/207 —0.135 0.261 —0.48 —0.30 —0.23 —0.15 0.08 
17 12/122 28/251 0.141 0.364 —0.47 —0.29 —0.21 —0.12 0.17 
18 6/154 8/151 0.322 0.553 —0.51 —0.30 —0.23 —0.13 0.15 
19 3/134 6/174 0.444 0.717 —0.53 —0.31 —0.23 —0.14 0.15 
20 40/218 32/209 —0.218 0.260 —0.50 —0.32 —0.25 —0.17 0.04 
21 43/364 27/391 —0.591 0.257 —0.64 —0.40 —0.31 —0.23 —0.09 
22 39/674 22/680 —0.608 0.272 —0.65 —0.40 —0.31 —0.23 —0.07 


Table 5.4 Results of 22 clinical trials of beta-blockers for reducing mortality after myocardial infarc- 
tion, with empirical log-odds and approximate sampling variances. Data from Yusuf et al. (1985). 
Posterior quantiles of treatment effects are based on 5000 draws from a Bayesian hierarchical model 
described here. Negative effects correspond to reduced probability of death under the treatment. 


enough to adapt to the data, thereby providing posterior inferences that account for the 
partial pooling as well as the uncertainty in the hyperparameters. 


5.6 Hierarchical modeling applied to a meta-analysis 


Meta-analysis is an increasingly popular and important process of summarizing and inte- 
grating the findings of research studies in a particular area. As a method for combining 
information from several parallel data sources, meta-analysis is closely connected to hierar- 
chical modeling. In this section we consider a relatively simple application of hierarchical 
modeling to a meta-analysis in medicine. We consider another meta-analysis problem in 
the context of a decision problem in Section 9.2. 

The data in our medical example are displayed in the first three columns of Table 5.4, 
which summarize mortality after myocardial infarction in 22 clinical trials, each consisting of 
two groups of heart attack patients randomly allocated to receive or not receive beta-blockers 
(a family of drugs that affect the central nervous system and can relax the heart muscles). 
Mortality varies from 3% to 21% across the studies, most of which show a modest, though 
not ‘statistically significant,’ benefit from the use of beta-blockers. The aim of a meta- 
analysis is to provide a combined analysis of the studies that indicates the overall strength 
of the evidence for a beneficial effect of the treatment under study. Before proceeding to a 
formal meta-analysis, it is important to apply rigorous criteria in determining which studies 
are included. (This relates to concerns of ignorability in data collection for observational 
studies, as discussed in Chapter 8.) 
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Defining a parameter for each study 


In the beta-blocker example, the meta-analysis involves data in the form of several 2 x 2 
tables. If clinical trial j (in the series to be considered for meta-analysis) involves the use 
of noj subjects in the control group and nı; in the treatment group, giving rise to yo; and 
yij deaths in control and treatment groups, respectively, then the usual sampling model 
involves two independent binomial distributions with probabilities of death po; and p1j, 
respectively. Estimands of interest include the difference in probabilities, pı; — Poj, the 
probability or risk ratio, p1;/po;, and the odds ratio, pj = a (cor eve For a number of 
reasons, including interpretability in a range of study designs (including case-control studies 
as well as clinical trials and cohort studies), and the fact that its posterior distribution is 
close to normality even for relatively small sample sizes, we concentrate on inference for the 


(natural) logarithm of the odds ratio, which we label 6; = log p;. 


A normal approximation to the likelihood 


Relatively simple Bayesian meta-analysis is possible using the normal-theory results of the 
previous sections if we summarize the results of each experiment j with an approximate 
normal likelihood for the parameter 6;. This is possible with a number of standard analytic 
approaches that produce a point estimate and standard errors, which can be regarded as 
approximating a normal mean and standard deviation. One approach is based on empirical 
logits: for each study j, one can estimate 6; by 


yj = log (=) — log (=) , (5.23) 
Nij — Yj Noj = Yoj 


with approximate sampling variance 


1 1 1 1 
Yij Nij = Yi Yoj Noj — Yoj 


We use the notation y; and o? to be consistent with our earlier expressions for the hier- 
archical normal model. There are various refinements of these estimates that improve the 
asymptotic normality of the sampling distributions involved (in particular, it is often rec- 
ommended to add a fraction such as 0.5 to each of the four counts in the 2 x 2 table), but 
whenever study-specific sample sizes are moderately large, such details do not concern us. 

The estimated log-odds ratios y; and their estimated standard errors o? are displayed 
as the fourth and fifth columns of Table 5.4. We use a hierarchical Bayesian analysis to 
combine information from the 22 studies and gain improved estimates of each 6;, along with 
estimates of the mean and variance of the effects over all studies. 


Goals of inference in meta-analysis 


Discussions of meta-analysis are sometimes imprecise about the estimands of interest in the 
analysis, especially when the primary focus is on testing the null hypothesis of no effect in 
any of the studies to be combined. Our focus is on estimating meaningful parameters, and 
for this objective there appear to be three possibilities, accepting the overarching assumption 
that the studies are comparable in some broad sense. The first possibility is that we view 
the studies as identical replications of each other, in the sense we regard the individuals in 
all the studies as independent samples from a common population, with the same outcome 
measures and so on. A second possibility is that the studies are so different that the results 
of any one study provide no information about the results of any of the others. A third, more 
general, possibility is that we regard the studies as exchangeable but not necessarily either 
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identical or completely unrelated; in other words we allow differences from study to study, 
but such that the differences are not expected a priori to have predictable effects favoring 
one study over another. As we have discussed in detail in this chapter, this third possibility 
represents a continuum between the two extremes, and it is this exchangeable model (with 
unknown hyperparameters characterizing the population distribution) that forms the basis 
of our Bayesian analysis. 

Exchangeability does not dictate the form of the joint distribution of the study effects. 
In what follows we adopt the convenient assumption of a normal distribution for the varying 
parameters; in practice it is important to check this assumption using some of the techniques 
discussed in Chapter 6. 

The first potential estimand of a meta-analysis, or a hierarchically structured problem 
in general, is the mean of the distribution of effect sizes, since this represents the overall 
‘average’ effect across all studies that could be regarded as exchangeable with the observed 
studies. Other possible estimands are the effect size in any of the observed studies and the 
effect size in another, comparable (exchangeable) unobserved study. 


What if exchangeability is inappropriate? 


When assuming exchangeability we assume there are no important covariates that might 
form the basis of a more complex model, and this assumption (perhaps misguidedly) is 
widely adopted in meta-analysis. What if other information (in addition to the data (n, y)) 
is available to distinguish among the J studies in a meta-analysis, so that an exchangeable 
model is inappropriate? In this situation, we can expand the framework of the model to be 
exchangeable in the observed data and covariates, for example using a hierarchical regression 
model, as in Chapter 15, so as to estimate how the treatment effect behaves as a function 
of the covariates. The real aim might in general be to estimate a response surface so that 
one could predict an effect based on known characteristics of a population and its exposure 
to risk. 


A hierarchical normal model 


A normal population distribution in conjunction with the approximate normal sampling 
distribution of the study-specific effect estimates allows an analysis of the same form as 
used for the SAT coaching example in the previous section. Let y; represent generically the 
point estimate of the effect 0; in the jth study, obtained from (5.23), where j = 1,..., J. 
The first stage of the hierarchical normal model assumes that 


y5|0;,0; ~ N(0;,03), 


where gj represents the corresponding estimated standard error from (5.24), which is as- 
sumed known without error. The simplification of known variances has little effect here 
because, with the large sample sizes (more than 50 persons in each treatment group in 
nearly all of the studies in the beta-blocker example), the binomial variances in each study 
are precisely estimated. At the second stage of the hierarchy, we again use an exchangeable 
normal prior distribution, with mean p and standard deviation 7, which are unknown hy- 
perparameters. Finally, a hyperprior distribution is required for u and 7. For this problem, 
it is reasonable to assume a noninformative or locally uniform prior density for u, since 
even with a small number of studies (say 5 or 10), the combined data become relatively 
informative about the center of the population distribution of effect sizes. As with the 
SAT coaching example, we also assume a locally uniform prior density for 7, essentially for 
convenience, although it is easy to modify the analysis to include prior information. 
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Posterior quantiles 
Estimand 2.5% 25% median 75% 97.5% 
Mean, u —0.37 —0.29 0.25 0.20 0.11 
Standard deviation, T 0.02 0.08 0.13 0.18 0.31 
Predicted effect, 6; —0.58 —0.34 —0.25 —0.17 0.11 


Table 5.5 Summary of posterior inference for the overall mean and standard deviation of study 
effects, and for the predicted effect in a hypothetical future study, from the meta-analysis of the 
beta-blocker trials in Table 5.4. All effects are on the log-odds scale. 


Results of the analysis and comparison to simpler methods 


The analysis of our meta-analysis model now follows exactly the same methodology as in 
the previous sections. First, a plot (not shown here) similar to Figure 5.5 shows that the 
marginal posterior density of 7 peaks at a nonzero value, although values near zero are 
clearly plausible, zero having a posterior density only about 25% lower than that at the 
mode. Posterior quantiles for the effects 6; for the 22 studies on the logit scale are displayed 
as the last columns of Table 5.4. 

Since the posterior distribution of T is concentrated around values that are small relative 
to the sampling standard deviations of the data (compare the posterior median of 7, 0.13, 
in Table 5.5 to the values of øj in the fourth column of Table 5.4), considerable shrinkage 
is evident in the Bayes estimates, especially for studies with low internal precision (for 
example, studies 1, 6, and 18). The substantial degree of homogeneity between the studies 
is further reflected in the large reductions in posterior variance obtained when going from 
the study-specific estimates to the Bayesian ones, which borrow strength from each other. 
Using an approximate approach fixing 7 would yield standard deviations that would be too 
small compared to the fully Bayesian ones. 

Histograms (not shown) of the simulated posterior densities for each of the individual 
effects exhibit skewness away from the central value of the overall mean, whereas the distri- 
bution of the overall mean has greater symmetry. The imprecise studies, such as 2 and 18, 
exhibit longer-tailed posterior distributions than the more precise ones, such as 7 and 14. 

In meta-analysis, interest often focuses on the estimate of the overall mean effect, p. 
Superimposing the graphs (not shown here) of the conditional posterior mean and standard 
deviation of u given T on the posterior density of 7 reveals a small range in the plausible 
values of E(u|7, y), from about —0.26 to just over —0.24, but sd(ju|7, y) varies by a factor 
of more than 2 across the plausible range of values of r. The latter feature indicates 
the importance of averaging over 7 in order to account adequately for uncertainty in its 
estimation. In fact, the conditional posterior standard deviation, sd(u|r, y) has the value 
0.060 at 7 = 0.13, whereas upon averaging over the posterior distribution for 7 we find a 
value of sd(u|y) = 0.071. 

Table 5.5 gives a summary of posterior inferences for the hyperparameters and 7 and 
the predicted effect, 0j, in a hypothetical future study. The approximate 95% highest pos- 
terior density interval for u is [—0.37,—0.11], or [0.69,0.90] when converted to the odds 
ratio scale (that is, exponentiated). In contrast, the 95% posterior interval that results 
from complete pooling—that is, assuming 7 = 0—is considerably narrower, [0.70, 0.85]. In 
the original published discussion of these data, it was remarked that the latter seems an 
‘unusually narrow range of uncertainty.’ The hierarchical Bayesian analysis suggests that 
this was due to the use of an inappropriate model that had the effect of claiming all the 
studies were identical. In mathematical terms, complete pooling makes the assumption that 
the parameter 7 is exactly zero, whereas the data supply evidence that 7 might be close 
to zero, but might also plausibly be as high as 0.3. A related concern is that commonly 
used analyses tend to place undue emphasis on inference for the overall mean effect. Un- 
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certainty about the probable treatment effect in a particular population where a study has 
not been performed (or indeed in a previously studied population but with a slightly mod- 
ified treatment) might be more reasonably represented by inference for a new study effect, 
exchangeable with those for which studies have been performed, rather than for the overall 
mean. In this case, uncertainty is even greater, as exhibited in the ‘Predicted effect’ row of 
Table 5.5; uncertainty for an individual patient includes yet another component of varia- 
tion. In particular, with the beta-blocker data, there is just over 10% posterior probability 
that the true effect, 6;, in a new study would be positive (corresponding to the treatment 
increasing the probability of death in that study). 


5.7 Weakly informative priors for variance parameters 


A key element in the analyses above is the prior distribution for the scale parameter, T. 
We have used the uniform, but various other noninformative prior distributions have been 
suggested in the Bayesian literature. It turns out that the choice of ‘noninformative’ prior 
distribution can have a big effect on inferences, especially for problems where the number 
of groups J is small or the group-level variation 7 is small. 

We discuss the options here in the context of the normal model, but the principles apply 
to inferences for group-level variances more generally. 


Concepts relating to the choice of prior distribution 


Improper limit of a prior distribution. Improper prior densities can, but do not necessarily, 
lead to proper posterior distributions. To avoid confusion it is useful to define improper 
distributions as particular limits of proper distributions. For the group-level variance pa- 
rameter, two commonly considered improper densities are uniform(0,A) on T, as A > oo, 
and inverse-gamma(e,€) on 7°, as € > 0. 

As we shall see, the uniform(0, A) model yields a limiting proper posterior density as 
A — œ, as long as the number of groups J is at least 3. Thus, for a finite but sufficiently 
large A, inferences are not sensitive to the choice of A. 

In contrast, the inverse-gamma(e,¢) model does not have any proper limiting poste- 
rior distribution. As a result, posterior inferences are sensitive to e—it cannot simply be 
comfortably set to a low value such as 0.001. 


Calibration. Posterior inferences can be evaluated using the concept of calibration of the 
posterior mean, the Bayesian analogue to the classical notion of bias. For any parameter 
Ø, if we label the posterior mean as Ê = E(6|y), we can define the miscalibration of the 
posterior mean as E(6|6) — Ô. If the prior distribution is true—that is, if the data are 
constructed by first drawing 0 from p(@), then drawing y from p(y|@)—then the posterior 
mean is automatically calibrated; that is, the miscalibration is 0 for all values of 6. 

To restate: in classical bias analysis, we condition on the true 0 and look at the distri- 
bution of the data-based estimate, 6. Ina Bayesian calibration analysis, we condition on 
the data y (and thus also on the estimate, Ê) and look at the distribution of parameters 0 
that could have produced these data. 

When considering improper models, the theory must be expanded, since it is impossible 
for 0 to be drawn from an unnormalized density. To evaluate calibration in this context, 
it is necessary to posit a ‘true prior distribution’ from which 0 is drawn along with the 
‘inferential prior distribution’ that is used in the Bayesian inference. 

For the hierarchical model for the 8 schools, we can consider the improper uniform 
density on 7 as a limit of uniform prior densities on the range (0, A), with A — oo. For 
any finite value of A, we can then see that the improper uniform density leads to inferences 
with a positive miscalibration—that is, overestimates (on average) of T. 
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We demonstrate this miscalibration in two steps. First, suppose that both the true 
and inferential prior distributions for 7 are uniform on (0, A). Then the miscalibration is 
trivially zero. Now keep the true prior distribution at U(0, A) and let the inferential prior 
distribution go to U(0, 00). This will necessarily increase 6 for any data y (since we are now 
averaging over values of 0 in the range [A,co)) without changing the true 0, thus causing 
the average value of the miscalibration to become positive. 


Classes of noninformative and weakly informative prior distributions for hierarchical 
variance parameters 


General considerations. We view any noninformative or weakly informative prior distribu- 
tion as inherently provisional—after the model has been fit, one should look at the posterior 
distribution and see if it makes sense. If the posterior distribution does not make sense, 
this implies that additional prior knowledge is available that has not been included in the 
model, and that contradicts the assumptions of the prior distribution that has been used. 
It is then appropriate to go back and alter the prior distribution to be more consistent with 
this external knowledge. 


Uniform prior distributions. We first consider uniform priors while recognizing that we 
must be explicit about the scale on which the distribution is defined. Various choices have 
been proposed for modeling variance parameters. A uniform prior distribution on log 7 
would seem natural—working with the logarithm of a parameter that must be positive— 
but it results in an improper posterior distribution. An alternative would be to define the 
prior distribution on a compact set (e.g., in the range [—A, A] for some large value of A), 
but then the posterior distribution would depend strongly on the lower bound —A of the 
prior support. 

The problem arises because the marginal likelihood, p(y|r)—after integrating over 0 and 
u in (5.16)—approaches a finite nonzero value as r — 0. Thus, if the prior density for log T 
is uniform, the posterior will have infinite mass integrating to the limit log 7 — —oo. To put 
it another way, in a hierarchical model the data can never rule out a group-level variance 
of zero, and so the prior distribution cannot put an infinite mass in this area. 

Another option is a uniform prior distribution on 7 itself, which has a finite integral 
near T = 0 and thus avoids the above problem. We have generally used this noninformative 
density in our applied work (as illustrated in Section 5.5), but it has a slightly disagreeable 
miscalibration toward positive values, with its infinite prior mass in the range T > oo. 
With J = 1 or 2 groups, this actually results in an improper posterior density, essentially 
concluding T = co and doing no pooling. In a sense this is reasonable behavior, since it 
would seem difficult from the data alone to decide how much, if any, pooling should be 
done with data from only one or two groups. However, from a Bayesian perspective it is 
awkward for the decision to be made ahead of time, as it were, with the data having no say 
in the matter. In addition, for small J, such as 4 or 5, we worry that the heavy right tail of 
the posterior distribution would lead to overestimates of 7 and thus result in pooling that 
is less than optimal for estimating the individual 6;’s. 

We can interpret these improper uniform prior densities as limits of weakly informative 
conditionally conjugate priors. The uniform prior distribution on log7 is equivalent to 
p(t) x T+ or p(T?) x T~?, which has the form of an inverse-y? density with 0 degrees of 
freedom and can be taken as a limit of proper inverse-gamma priors. 

The uniform density on T is equivalent to p(t?) x 7~1, an inverse-y? density with —1 
degrees of freedom. This density cannot easily be seen as a limit of proper inverse-y? 
densities (since these must have positive degrees of freedom), but it can be interpreted as a 
limit of the half-¢ family on 7, where the scale approaches oo (and any value of v). 

Another noninformative prior distribution sometimes proposed in the Bayesian literature 
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is uniform on T?. We do not recommend this, as it seems to have the miscalibration toward 
higher values as described above, but more so, and also requires J > 4 groups for a proper 
posterior distribution. 


Inverse-gamma(e,€) prior distributions. The parameter r in model (5.21) does not have 
any simple family of conjugate prior distributions because its marginal likelihood depends 
in a complex way on the data from all J groups. However, the inverse-gamma family 
is conditionally conjugate given the other parameters in the model: that is, if 7? has an 
inverse-gamma prior distribution, then the conditional posterior distribution p(T? |0, u, y) 
is also inverse-gamma. The inverse-gamma(q, 3) model for 7? can also be expressed as an 
inverse-x7 distribution with scale s? = £ and degrees of freedom v = 2a. The inverse- 
x? parameterization can be helpful in understanding the information underlying various 
choices of proper prior distributions. 

The inverse-gamma(e€, €) prior distribution is an attempt at noninformativeness within 
the conditionally conjugate family, with €e set to a low value such as 1 or 0.01 or 0.001. 
A difficulty of this prior distribution is that in the limit of €e > 0 it yields an improper 
posterior density, and thus € must be set to a reasonable value. Unfortunately, for datasets 
in which low values of 7 are possible, inferences become very sensitive to € in this model, 
and the prior distribution hardly looks noninformative, as we illustrate in Figure 5.9. 


Half-Cauchy prior distributions. We shall also consider the t family of distributions (actu- 
ally, the half-t, since the scale parameter 7 is constrained to be positive) as an alternative 
class that includes normal and Cauchy as edge cases. We first considered the t model for 
this problem because it can be expressed as a conditionally conjugate prior distribution for 
T using a reparameterization. 

For our purposes here, however, it is enough to recognize that the half-Cauchy can be a 
convenient weakly informative family; the distribution has a broad peak at zero and a single 
scale parameter, which we shall label A to indicate that it could be set to some large value. 
In the limit A — oo this becomes a uniform prior density on T. Large but finite values of 
A represent prior distributions which we consider weakly informative because, even in the 
tail, they have a gentle slope (unlike, for example, a half-normal distribution) and can let 
the data dominate if the likelihood is strong in that region. We shall consider half-Cauchy 
models for variance parameters which are estimated from a small number of groups (so that 
inferences are sensitive to the choice of weakly informative prior distribution). 


Application to the 8-schools example 


We demonstrate the properties of some proposed noninformative prior densities on the 
eight-schools example of Section 5.5. Here, the parameters 0),...,0g represent the relative 
effects of coaching programs in eight different schools, and T represents the between-school 
standard deviations of these effects. The effects are measured as points on the test, which 
was scored from 200 to 800 with an average of about 500; thus the largest possible range of 
effects could be about 300 points, with a realistic upper limit on 7 of 100, say. 


Noninformative prior distributions for the 8-schools problem. Figure 5.9 displays the pos- 
terior distributions for the 8-schools model resulting from three different choices of prior 
distributions that are intended to be noninformative. 

The leftmost histogram shows posterior inference for T for the model with uniform prior 
density. The data show support for a range of values below T = 20, with a slight tail after 
that, reflecting the possibility of larger values, which are difficult to rule out given that the 
number of groups J is only 8—that is, not much more than the J = 3 required to ensure a 
proper posterior density with finite mass in the right tail. 

In contrast, the middle histogram in Figure 5.9 shows the result with an inverse- 


gamma(1,1) prior distribution for 7?. This new prior distribution leads to changed in- 
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8 schools: posterior on t given 8 schools: posterior on t given 8 schools: posterior on t given 
uniform prior on T inv-gamma (1, 1) prior on a inv-gamma (.001, .001) prior on t 
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Figure 5.9 Histograms of posterior simulations of the between-school standard deviation, T, 
from models with three different prior distributions: (a) uniform prior distribution on T, (b) 
inverse-gamma(1,1) prior distribution on T”, (c) inverse-gamma(0.001, 0.001) prior distribution 
on T°. Overlain on each is the corresponding prior density function for r. (For models (b) and 
(c), the density for T is calculated using the gamma density function multiplied by the Jacobian of 
the 1/7? transformation.) In models (b) and (c), posterior inferences are strongly constrained by 
the prior distribution. 


ferences. In particular, the posterior mean and median of 7 are lower, and shrinkage of the 
0;’s is greater than in the previously fitted model with a uniform prior distribution on 7. To 
understand this, it helps to graph the prior distribution in the range for which the posterior 
distribution is substantial. The graph shows that the prior distribution is concentrated in 
the range [0.5, 5], a narrow zone in which the likelihood is close to flat compared to this prior 
(as we can see because the distribution of the posterior simulations of 7 closely matches the 
prior distribution, p(r)). By comparison, in the left graph, the uniform prior distribution 
on T seems closer to ‘noninformative’ for this problem, in the sense that it does not appear 
to be constraining the posterior inference. 

Finally, the rightmost histogram in Figure 5.9 shows the corresponding result with an 
inverse-gamma(0.001, 0.001) prior distribution for t?. This prior distribution is even more 
sharply peaked near zero and further distorts posterior inferences, with the problem arising 
because the marginal likelihood for 7 remains high near zero. 

In this example, we do not consider a uniform prior density on logt, which would yield 
an improper posterior density with a spike at 7 = 0, like the rightmost graph in Figure 5.9 
but more so. We also do not consider a uniform prior density on T?, which would yield a 
posterior similar to the leftmost graph in Figure 5.9, but with a slightly higher right tail. 

This example is a gratifying case in which the simplest approach—the uniform prior 
density on T—seems to perform well. As detailed in Appendix C, this model is also straight- 
forward to program directly in R or Stan. 

The appearance of the histograms and density plots in Figure 5.9 is crucially affected by 
the choice to plot them on the scale of r. If instead they were plotted on the scale of log 7, 
the inverse-gamma(0.001, 0.001) prior density would appear to be the flattest. However, the 
inverse-gamma(e, €) prior is not at all ‘noninformative’ for this problem since the resulting 
posterior distribution remains highly sensitive to the choice of e. The hierarchical model 
likelihood does not constrain log T in the limit log 7 — —oo, and so a prior distribution that 
is noninformative on the log scale will not work. 


Weakly informative prior distribution for the 8-schools problem 


The uniform prior distribution seems fine for the 8-school analysis, but problems arise if the 
number of groups J is much smaller, in which case the data supply little information about 
the group-level variance, and a noninformative prior distribution can lead to a posterior 
distribution that is improper or is proper but unrealistically broad. We demonstrate by 
reanalyzing the 8-schools example using just the data from the first three of the schools. 
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3 schools: posterior on t given 3 schools: posterior on t given 
uniform prior on 7 half-Cauchy (25) prior on t 
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Figure 5.10 Histograms of posterior simulations of the between-school standard deviation, T, from 
models for the 3-schools data with two different prior distributions on T: (a) uniform (0,co), (b) 
half-Cauchy with scale 25, set as a weakly informative prior distribution given that T was expected 
to be well below 100. The histograms are not on the same scales. Overlain on each histogram is 
the corresponding prior density function. With only J = 3 groups, the noninformative uniform 
prior distribution is too weak, and the proper Cauchy distribution works better, without appearing 
to distort inferences in the area of high likelihood. 


Figure 5.10 displays the inferences for 7 based on two different priors. First we continue 
with the default uniform distribution that worked well with J = 8 (as seen in Figure 5.9). 
Unfortunately, as the left histogram of Figure 5.10 shows, the resulting posterior distribution 
for the 3-schools dataset has an extremely long right tail, containing values of 7 that are 
too high to be reasonable. This heavy tail is expected since J is so low (if J were any lower, 
the right tail would have an infinite integral), and using this as a posterior distribution will 
have the effect of underpooling the estimates of the school effects 6;. 

The right histogram of Figure 5.10 shows the posterior inference for r resulting from 
a half-Cauchy prior distribution with scale parameter A = 25 (a value chosen to be a bit 
higher than we expect for the standard deviation of the underlying 0;’s in the context of 
this educational testing example, so that the model will constrain 7 only weakly). As the 
line on the graph shows, this prior distribution is high over the plausible range of r < 50, 
falling off gradually beyond this point. This prior distribution appears to perform well in 
this example, reflecting the marginal likelihood for 7 at its low end but removing much of 
the unrealistic upper tail. 

This half-Cauchy prior distribution would also perform well in the 8-schools problem; 
however it was unnecessary because the default uniform prior gave reasonable results. With 
only 3 schools, we went to the trouble of using a weakly informative prior, a distribution 
that was not intended to represent our actual prior state of knowledge about 7 but rather 
to constrain the posterior distribution, to an extent allowed by the data. 


5.8 Bibliographic note 


The early non-Bayesian work on shrinkage estimation of Stein (1955) and James and Stein 
(1960) was influential in the development of hierarchical normal models. Efron and Morris 
(1971, 1972) present subsequent theoretical work on the topic. Robbins (1955, 1964) con- 
structs and justifies hierarchical methods from a decision-theoretic perspective. De Finetti’s 
theorem is described by de Finetti (1974); Bernardo and Smith (1994) discuss its role in 
Bayesian modeling. An early thorough development of the idea of Bayesian hierarchical 
modeling is given by Good (1965). 

Mosteller and Wallace (1964) analyzed a hierarchical Bayesian model using the negative 
binomial distribution for counts of words in a study of authorship. Restricted to the limited 
computing power at the time, they used various approximations and point estimates for 
hyperparameters. 

Other historically influential papers on ‘empirical Bayes’ (or, in our terminology, hierar- 
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chical Bayes) include Hartley and Rao (1967), Laird and Ware (1982) on longitudinal mod- 
eling, and Clayton and Kaldor (1987) and Breslow (1990) on epidemiology and biostatistics. 
Morris (1983) and Deely and Lindley (1981) explored the relation between Bayesian and 
non-Bayesian ideas for these models. 

The problem of estimating several normal means using an exchangeable hierarchical 
model was treated in a fully Bayesian framework by Hill (1965), Tiao and Tan (1965, 
1966), and Lindley (1971b). Box and Tiao (1973) present hierarchical normal models using 
slightly different notation from ours. They compare Bayesian and non-Bayesian methods 
and discuss the analysis of variance table in some detail. More references on hierarchical 
normal models appear in the bibliographic note at the end of Chapter 15. 

The past few decades have seen the publication of applied Bayesian analyses using hierar- 
chical models in a wide variety of application areas. For example, an important application 
of hierarchical models is ‘small-area estimation,’ in which estimates of population charac- 
teristics for local areas are improved by combining the data from each area with information 
from neighboring areas (with important early work from Fay and Herriot, 1979, Dempster 
and Raghunathan, 1987, and Mollie and Richardson, 1991). Other applications that have 
motivated methodological development include measurement error problems in epidemiol- 
ogy (for example, Richardson and Gilks, 1993), multiple comparisons in toxicology (Meng 
and Dempster, 1987), and education research (Bock, 1989). We provide references to a 
number of other applications in later chapters dealing with specific model types. 

Hierarchical models can be viewed as a subclass of ‘graphical models,’ and this connec- 
tion has been elegantly exploited for Bayesian inference in the development of the computer 
package Bugs, using techniques that will be explained in Chapter 11 (see also Appendix C); 
see Thomas, Spiegelhalter, and Gilks (1992), and Spiegelhalter et al. (1994, 2003). Related 
discussion and theoretical work appears in Lauritzen and Spiegelhalter (1988), Pearl (1988), 
Wermuth and Lauritzen (1990), and Normand and Tritchler (1992). 

The rat tumor data were analyzed hierarchically by Tarone (1982) and Dempster, Sel- 
wyn, and Weeks (1983); our approach is close in spirit to the latter paper’s. Leonard (1972) 
and Novick, Lewis, and Jackson (1973) are early examples of hierarchical Bayesian analysis 
of binomial data. 

Much of the material in Sections 5.4 and 5.5, along with much of Section 6.5, originally 
appeared in Rubin (1981a), which is an early example of an applied Bayesian analysis using 
simulation techniques. For later work on the effects of coaching on Scholastic Aptitude Test 
scores, see Hansen (2004). 

The weakly-informative half-Cauchy prior distribution for the 3-schools problem in Sec- 
tion 5.7 comes from Gelman (2006a). Polson and Scott (2012) provide a theoretical justifi- 
cation for this model. 

The material of Section 5.6 is adapted from Carlin (1992), which contains several key 
references on meta-analysis; the original data for the example are from Yusuf et al. (1985); 
a similar Bayesian analysis of these data under a slightly different model appears as an 
example in Spiegelhalter et al. (1994, 2003). Thall et al. (2003) discuss hierarchical models 
for medical treatments that vary across subtypes of a disease. More general treatments 
of meta-analysis from a Bayesian perspective are provided by DuMouchel (1990), Rubin 
(1989), Skene and Wakefield (1990), and Smith, Spiegelhalter, and Thomas (1995). An ex- 
ample of a Bayesian meta-analysis appears in Dominici et al. (1999). DuMouchel and Harris 
(1983) present what is essentially a meta-analysis with covariates on the studies; this article 
is accompanied by some interesting discussion by prominent Bayesian and non-Bayesian 
statisticians. Higgins and Whitehead (1996) discuss how to construct a prior distribution 
for the group-level variance in a meta-analysis by considering it as an example from larger 
population of meta-analyses. Lau, Ioannidis, and Schmid (1997) provide practical advice 
on meta-analysis. 
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5.9 Exercises 


1. Exchangeability with known model parameters: For each of the following three examples, 
answer: (i) Are observations yı and y2 exchangeable? (ii) Are observations yı and y2 
independent? (iii) Can we act as if the two observations are independent? 


(a) A box has one black ball and one white ball. We pick a ball yı at random, put it back, 
and pick another ball y2 at random. 


(b) A box has one black ball and one white ball. We pick a ball yı at random, we do not 
put it back, then we pick ball y2. 


(c) A box has a million black balls and a million white balls. We pick a ball yı at random, 
we do not put it back, then we pick ball y2 at random. 


2. Exchangeability with unknown model parameters: For each of the following three exam- 
ples, answer: (i) Are observations yı and y2 exchangeable? (ii) Are observations yı and 
y2 independent? (iii) Can we act as if the two observations are independent? 

(a) A box has n black and white balls but we do not know how many of each color. We 
pick a ball yı at random, put it back, and pick another ball yz at random. 


(b) A box has n black and white balls but we do not know how many of each color. We 
pick a ball yı at random, we do not put it back, then we pick ball yz at random. 


(c) Same as (b) but we know that there are many balls of each color in the box. 
3. Hierarchical models and multiple comparisons: 


(a) Reproduce the computations in Section 5.5 for the educational testing example. Use 
the posterior simulations to estimate (i) for each school j, the probability that its 
coaching program is the best of the eight; and (ii) for each pair of schools, j and k, 
the probability that the coaching program in school j is better than that in school k. 


(b) Repeat (a), but for the simpler model with 7 set to co (that is, separate estimation 
for the eight schools). In this case, the probabilities (ii) can be computed analytically. 


(c) Discuss how the answers in (a) and (b) differ. 
(d) In the model with 7 set to 0, the probabilities (i) and (ii) have degenerate values; what 
are they? 


4. Exchangeable prior distributions: suppose it is known a priori that the 2J parameters 
0,,...,027 are clustered into two groups, with exactly half being drawn from a N(1, 1) 
distribution, and the other half being drawn from a N(—1,1) distribution, but we have 
not observed which parameters come from which distribution. 


(a) Are 6,,...,027 exchangeable under this prior distribution? 


(b) Show that this distribution cannot be written as a mixture of independent and iden- 
tically distributed components. 


(c) Why can we not simply take the limit as J — oo and get a counterexample to de 
Finetti’s theorem? 


See Exercise 8.10 for a related problem. 


5. Mixtures of independent distributions: suppose the distribution of 0 = (61,...,07) can 
be written as a mixture of independent and identically distributed components: 


J 
(8) = | T] oOo). 


Prove that the covariances cov(6;,6;) are all nonnegative. 


6. Exchangeable models: 
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(a) In the divorce rate example of Section 5.2, set up a prior distribution for the values 
yi,---,yg that allows for one low value (Utah) and one high value (Nevada), with 
independent and identical distributions for the other six values. This prior distribution 
should be exchangeable, because it is not known which of the eight states correspond 
to Utah and Nevada. 


(b) Determine the posterior distribution for yg under this model given the observed values 
of y1,..., y7 given in the example. This posterior distribution should probably have 
two or three modes, corresponding to the possibilities that the missing state is Utah, 
Nevada, or one of the other six. 

(c) Now consider the entire set of eight data points, including the value for yg given at 
the end of the example. Are these data consistent with the prior distribution you gave 
in part (a) above? In particular, did your prior distribution allow for the possibility 
that the actual data have an outlier (Nevada) at the high end, but no outlier at the 
low end? 


7. Continuous mixture models: 


(a) If yJ@ ~ Poisson(#), and 0 ~ Gamma(a, 8), then the marginal (prior predictive) 
distribution of y is negative binomial with parameters a and 8 (or p = 6/(1 + £)). 
Use the formulas (2.7) and (2.8) to derive the mean and variance of the negative 
binomial. 

(b) In the normal model with unknown location and scale (1,07), the noninformative 
prior density, p(y, 07) x 1/o?, results in a normal-inverse-? posterior distribution for 
(u,07). Marginally then yn(u — Y)/s has a posterior distribution that is t,_1. Use 
(2.7) and (2.8) to derive the first two moments of the latter distribution, stating the 
appropriate condition on n for existence of both moments. 


8. Discrete mixture models: if pm(0), for m = 1,..., M, are conjugate prior densities for 
the sampling model y|0, show that the class of finite mixture prior densities given by 


M 
p(0) = 5 AmPm(0) 


m=1 


is also a conjugate class, where the Am’s are nonnegative weights that sum to 1. This 
can provide a useful extension of the natural conjugate prior family to more flexible 
distributional forms. As an example, use the mixture form to create a bimodal prior 
density for a normal mean, that is thought to be near 1, with a standard deviation of 
0.5, but has a small probability of being near —1, with the same standard deviation. If 
the variance of each observation yi,...,yio is known to be 1, and their observed mean 
is Y = —0.25, derive your posterior distribution for the mean, making a sketch of both 
prior and posterior densities. Be careful: the prior and posterior mixture proportions are 
different. 


9. Noninformative hyperprior distributions: consider the hierarchical binomial model in 
Section 5.3. Improper posterior distributions are, in fact, a general problem with hier- 
archical models when a uniform prior distribution is specified for the logarithm of the 
population standard deviation of the exchangeable parameters. In the case of the beta 
population distribution, the prior variance is approximately (a+)~! (see Appendix A), 
and so a uniform distribution on log(a+) is approximately uniform on the log standard 
deviation. The resulting unnormalized posterior density (5.8) has an infinite integral 
in the limit as the population standard deviation approaches 0. We encountered the 
problem again in Section 5.4 for the hierarchical normal model. 


(a) Show that, with a uniform prior density on (log(Z),log(a+)), the unnormalized 
posterior density has an infinite integral. 
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(b) A simple way to avoid the impropriety is to assign a uniform prior distribution to the 
standard deviation parameter itself, rather than its logarithm. For the beta population 
distribution we are considering here, this is achieved approximately by assigning a uni- 
form prior distribution to (a+8)~!/?. Show that combining this with an independent 

a 


uniform prior distribution on as yields the prior density (5.10). 


(c) Show that the resulting posterior density (5.8) is proper as long as 0 < y; < nj for at 
least one experiment j. 


10. Checking the integrability of the posterior distribution: consider the hierarchical normal 
model in Section 5.4. 


(a) If the hyperprior distribution is p(u,7) x 7~! (that is, plu, log T) œ 1), show that the 
posterior density is improper. 

(b) If the hyperprior distribution is p(u, T) « 1, show that the posterior density is proper 
if J > 2. 


(c) How would you analyze SAT coaching data if J = 2 (that is, data from only two 
schools)? 


11. Nonconjugate hierarchical models: suppose that in the rat tumor example, we wish to 
use a normal population distribution on the log-odds scale: logit(#;) ~ N(u,7?), for 
j=1,...,J. As in Section 5.3, you will assign a noninformative prior distribution to the 
hyperparameters and perform a full Bayesian analysis. 


(a) Write the joint posterior density, p(0, u, Ty). 
(b) Show that the integral (5.4) has no closed-form expression. 
(c) Why is expression (5.5) no help for this problem? 


In practice, we can solve this problem by normal approximation, importance sampling, 
and Markov chain simulation, as described in Part III. 


12. Conditional posterior means and variances: derive analytic expressions for E(@;|r, y) and 
var(6;|7,y) in the hierarchical normal model (and used in Figures 5.6 and 5.7). (Hint: 
use (2.7) and (2.8), averaging over ju.) 

13. Hierarchical binomial model: Exercise 3.8 described a survey of bicycle traffic in Berkeley, 
California, with data displayed in Table 3.3. For this problem, restrict your attention to 
the first two rows of the table: residential streets labeled as ‘bike routes,’ which we will 
use to illustrate this computational exercise. 


(a) Set up a model for the data in Table 3.3 so that, for j = 1,..., 10, the observed number 
of bicycles at location j is binomial with unknown probability 0; and sample size equal 
to the total number of vehicles (bicycles included) in that block. The parameter 6; 
can be interpreted as the underlying or ‘true’ proportion of traffic at location j that is 
bicycles. (See Exercise 3.8.) Assign a beta population distribution for the parameters 
0; and a noninformative hyperprior distribution as in the rat tumor example of Section 
5.3. Write down the joint posterior distribution. 


(b) Compute the marginal posterior density of the hyperparameters and draw simulations 
from the joint posterior distribution of the parameters and hyperparameters, as in 
Section 5.3. 

(c) Compare the posterior distributions of the parameters 6; to the raw proportions, 
(number of bicycles / total number of vehicles) in location j. How do the inferences 
from the posterior distribution differ from the raw proportions? 

(d) Give a 95% posterior interval for the average underlying proportion of traffic that is 
bicycles. 

(e) A new city block is sampled at random and is a residential street with a bike route. In 
an hour of observation, 100 vehicles of all kinds go by. Give a 95% posterior interval 
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for the number of those vehicles that are bicycles. Discuss how much you trust this 
interval in application. 


(£) Was the beta distribution for the 0;’s reasonable? 


14. Hierarchical Poisson model: consider the dataset in the previous problem, but suppose 
only the total amount of traffic at each location is observed. 


(a) Set up a model in which the total number of vehicles observed at each location j 
follows a Poisson distribution with parameter 6;, the ‘true’ rate of traffic per hour at 
that location. Assign a gamma population distribution for the parameters 0; and a 
noninformative hyperprior distribution. Write down the joint posterior distribution. 

(b) Compute the marginal posterior density of the hyperparameters and plot its contours. 
Simulate random draws from the posterior distribution of the hyperparameters and 
make a scatterplot of the simulation draws. 

(c) Is the posterior density integrable? Answer analytically by examining the joint pos- 
terior density at the limits or empirically by examining the plots of the marginal 
posterior density above. 

(d) If the posterior density is not integrable, alter it and repeat the previous two steps. 


(e) Draw samples from the joint posterior distribution of the parameters and hyperpa- 
rameters, by analogy to the method used in the hierarchical binomial model. 


15. Meta-analysis: perform the computations for the meta-analysis data of Table 5.4. 


(a) Plot the posterior density of 7 over an appropriate range that includes essentially all 
of the posterior density, analogous to Figure 5.5. 

(b) Produce graphs analogous to Figures 5.6 and 5.7 to display how the posterior means 
and standard deviations of the 0;’s depend on T. 

(c) Produce a scatterplot of the crude effect estimates vs. the posterior median effect 
estimates of the 22 studies. Verify that the studies with smallest sample sizes are 
partially pooled the most toward the mean. 

(d) Draw simulations from the posterior distribution of a new treatment effect, ĝ;. Plot 
a histogram of the simulations. 

(e) Given the simulations just obtained, draw simulated outcomes from replications of 
a hypothetical new experiment with 100 persons in each of the treated and control 
groups. Plot a histogram of the simulations of the crude estimated treatment effect 
(5.23) in the new experiment. 


16. Equivalent data: Suppose we wish to apply the inferences from the meta-analysis example 
in Section 5.6 to data on a new study with equal numbers of people in the control and 
treatment groups. How large would the study have to be so that the prior and data were 
weighted equally in the posterior inference for that study? 


17. Informative prior distributions: Continuing the example from Exercise 2.22, consider a 
(hypothetical) study of a simple training program for basketball free-throw shooting. A 
random sample of 100 college students is recruited into the study. Each student first 
shoots 100 free-throws to establish a baseline success probability. Each student then 
takes 50 practice shots each day for a month. At the end of that time, he or she takes 
100 shots for a final measurement. 

Let 0; be the improvement in success probability for person i. For simplicity, assume the 
0;’s are normally distributed with mean u and standard deviation ø. 
Give three joint prior distributions for p, a: 


(a) A noninformative prior distribution, 
(b) A subjective prior distribution based on your best knowledge, and 
(c) A weakly informative prior distribution. 
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Part I]: Fundamentals of Bayesian Data 
Analysis 


For most problems of applied Bayesian statistics, the data analyst must go beyond the 
simple structure of prior distribution, likelihood, and posterior distribution. In Chapter 6, 
we discuss methods of assessing the sensitivity of posterior inferences to model assumptions 
and checking the fit of a probability model to data and substantive information. Model 
checking allows an escape from the tautological aspect of formal approaches to Bayesian 
inference, under which all conclusions are conditional on the truth of the posited model. 
Chapter 7 considers evaluating and comparing models using predictive accuracy, adjusting 
for the parameters being fit to the data. Chapter 8 outlines the role of study design and 
methods of data collection in probability modeling, focusing on how to set up Bayesian 
inference for sample surveys, designed experiments, and observational studies; this chapter 
contains some of the most conceptually distinctive and potentially difficult material in 
the book. Chapter 9 discusses the use of Bayesian inference in applied decision analysis, 
illustrating with examples from social science, medicine, and public health. These four 
chapters explore the creative choices that are required, first to set up a Bayesian model in 
a complex problem, then to perform the model checking and confidence building that is 
typically necessary to make posterior inferences scientifically defensible, and finally to use 
the inferences in decision making. 
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Chapter 6 


Model checking 


6.1 The place of model checking in applied Bayesian statistics 


Once we have accomplished the first two steps of a Bayesian analysis—constructing a prob- 
ability model and computing the posterior distribution of all estimands—we should not 
ignore the relatively easy step of assessing the fit of the model to the data and to our 
substantive knowledge. It is difficult to include in a probability distribution all of one’s 
knowledge about a problem, and so it is wise to investigate what aspects of reality are not 
captured by the model. 

Checking the model is crucial to statistical analysis. Bayesian prior-to-posterior infer- 
ences assume the whole structure of a probability model and can yield misleading inferences 
when the model is poor. A good Bayesian analysis, therefore, should include at least some 
check of the adequacy of the fit of the model to the data and the plausibility of the model 
for the purposes for which the model will be used. This is sometimes discussed as a problem 
of sensitivity to the prior distribution, but in practice the likelihood model is typically just 
as suspect; throughout, we use ‘model’ to encompass the sampling distribution, the prior 
distribution, any hierarchical structure, and issues such as which explanatory variables have 
been included in a regression. 


Sensitivity analysis and model improvement 


It is typically the case that more than one reasonable probability model can provide an 
adequate fit to the data in a scientific problem. The basic question of a sensitivity analysis 
is: how much do posterior inferences change when other reasonable probability models 
are used in place of the present model? Other reasonable models may differ substantially 
from the present model in the prior specification, the sampling distribution, or in what 
information is included (for example, predictor variables in a regression). It is possible that 
the present model provides an adequate fit to the data, but that posterior inferences differ 
under plausible alternative models. 

In theory, both model checking and sensitivity analysis can be incorporated into the 
usual prior-to-posterior analysis. Under this perspective, model checking is done by set- 
ting up a comprehensive joint distribution, such that any data that might be observed are 
plausible outcomes under the joint distribution. That is, this joint distribution is a mixture 
of all possible ‘true’ models or realities, incorporating all known substantive information. 
The prior distribution in such a case incorporates prior beliefs about the likelihood of the 
competing realities and about the parameters of the constituent models. The posterior dis- 
tribution of such an exhaustive probability model automatically incorporates all ‘sensitivity 
analysis’ but is still predicated on the truth of some member of the larger class of models. 

In practice, however, setting up such a super-model to include all possibilities and all 
substantive knowledge is both conceptually impossible and computationally infeasible in all 
but the simplest problems. It is thus necessary for us to examine our models in other ways 
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to see how they fail to fit reality and how sensitive the resulting posterior distributions are 
to arbitrary specifications. 


Judging model flaws by their practical implications 


We do not like to ask, ‘Is our model true or false?’, since probability models in most 
data analyses will not be perfectly true. Even the coin tosses and die rolls ubiquitous in 
probability theory texts are not truly exchangeable. The more relevant question is, ‘Do the 
model’s deficiencies have a noticeable effect on the substantive inferences?’ 

In the examples of Chapter 5, the beta population distribution for the tumor rates and 
the normal distribution for the eight school effects are both chosen partly for convenience. 
In these examples, making convenient distributional assumptions turns out not to matter, 
in terms of the impact on the inferences of most interest. How to judge when assumptions 
of convenience can be made safely is a central task of Bayesian sensitivity analysis. Failures 
in the model lead to practical problems by creating clearly false inferences about estimands 
of interest. 


6.2 Do the inferences from the model make sense? 


In any applied problem, there will be knowledge that is not included formally in either 
the prior distribution or the likelihood, for reasons of convenience or objectivity. If the 
additional information suggests that posterior inferences of interest are false, then this 
suggests a potential for creating a more accurate probability model for the parameters and 
data collection process. We illustrate with an example of a hierarchical regression model. 


Example. Evaluating election predictions by comparing to substantive po- 
litical knowledge 

Figure 6.1 displays a forecast, made in early October, 1992, of the probability that Bill 
Clinton would win each state in the U.S. presidential election that November. The 
estimates are posterior probabilities based on a hierarchical linear regression model. 
For each state, the height of the shaded part of the box represents the estimated 
probability that Clinton would win the state. Even before the election occurred, the 
forecasts for some of the states looked wrong; for example, from state polls, Clinton 
was known in October to be much weaker in Texas and Florida than shown in the 
map. This does not mean that the forecast is useless, but it is good to know where 
the weak points are. Certainly, after the election, we can do an even better job of 
criticizing the model and understanding its weaknesses. We return to this election 
forecasting example in Section 15.2 as an example of a hierarchical linear model. 


External validation 


More formally, we can check a model by external validation using the model to make predic- 
tions about future data, and then collecting those data and comparing to their predictions. 
Posterior means should be correct on average, 50% intervals should contain the true values 
half the time, and so forth. We used external validation to check the empirical probability 
estimates in the record-linkage example in Section 1.7, and we apply the idea again to check 
a toxicology model in Section 19.2. In the latter example, the external validation (see Figure 
19.10 on page 484) reveals a generally reasonable fit but with some notable discrepancies 
between predictions and external data. Often we need to check the model before obtaining 
new data or waiting for the future to happen. In this chapter and the next, we discuss 
methods which can approximate external validation using the data we already have. 
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Figure 6.1 Summary of a forecast of the 1992 U.S. presidential election performed one month 
before the election. For each state, the proportion of the box that is shaded represents the estimated 
probability of Clinton winning the state; the width of the box is proportional to the number of 
electoral votes for the state. 


Choices in defining the predictive quantities 


A single model can be used to make different predictions. For example, in the SAT example 
we could consider a joint prediction for future data from the 8 schools in the study, p(gly), 
a joint prediction for 8 new schools p(gily), i = 9,...,16, or any other combination of new 
and existing schools. Other scenarios may have even more different choices in defining the 
focus of predictions. For example, in analyses of sample surveys and designed experiments, 
it often makes sense to consider hypothetical replications of the experiment with a new 
randomization of selection or treatment assignment, by analogy to classical randomization 
tests. 

Sections 6.3 and 6.4 discuss posterior predictive checking, which use global summaries 
to check the joint posterior predictive distribution p(jly). At the end of Section 6.3 we 
briefly discuss methods that combine inferences for local quantities to check marginal pre- 
dictive distributions p(ği|y), an idea that is related to cross-validation methods considered 
in Chapter 7. 


6.3 Posterior predictive checking 


If the model fits, then replicated data generated under the model should look similar to 
observed data. To put it another way, the observed data should look plausible under 
the posterior predictive distribution. This is really a self-consistency check: an observed 
discrepancy can be due to model misfit or chance. 

Our basic technique for checking the fit of a model to data is to draw simulated values 
from the joint posterior predictive distribution of replicated data and compare these samples 
to the observed data. Any systematic differences between the simulations and the data 
indicate potential failings of the model. 

We introduce posterior predictive checking with a simple example of an obviously poorly 
fitting model, and then in the rest of this section we lay out the key choices involved in pos- 
terior predictive checking. Sections 6.3 and 6.4 discuss numerical and graphical predictive 
checks in more detail. 


Example. Comparing Newcomb’s speed of light measurements to the pos- 
terior predictive distribution 

Simon Newcomb’s 66 measurements on the speed of light are presented in Section 3.2. 
In the absence of other information, in Section 3.2 we modeled the measurements as 
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Figure 6.2 Twenty replications, y"?, of the speed of light data from the posterior predictive distri- 
bution, p(y"|y); compare to observed data, y, in Figure 3.1. Each histogram displays the result 
of drawing 66 independent values Ji from a common normal distribution with mean and variance 
(u,0°) drawn from the posterior distribution, p(p,07\y), under the normal model. 


-40 -20 0 20 
Figure 6.3 Smallest observation of Newcomb’s speed of light data (the vertical line at the left of the 


graph), compared to the smallest observations from each of the 20 posterior predictive simulated 
datasets displayed in Figure 6.2. 


N(u, 07), with a noninformative uniform prior distribution on (u, log o). However, the 
lowest of Newcomb’s measurements look like outliers compared to the rest of the data. 
Could the extreme measurements have reasonably come from a normal distribution? 
We address this question by comparing the observed data to what we expect to be 
observed under our posterior distribution. Figure 6.2 displays twenty histograms, 
each of which represents a single draw from the posterior predictive distribution of 
the values in Newcomb’s experiment, obtained by first drawing (1,07) from their 
joint posterior distribution, then drawing 66 values from a normal distribution with 
this mean and variance. All these histograms look different from the histogram of 
actual data in Figure 3.1 on page 67. One way to measure the discrepancy is to 
compare the smallest value in each hypothetical replicated dataset to Newcomb’s 
smallest observation, —44. The histogram in Figure 6.3 shows the smallest observation 
in each of the 20 hypothetical replications; all are much larger than Newcomb’s smallest 
observation, which is indicated by a vertical line on the graph. The normal model 
clearly does not capture the variation that Newcomb observed. A revised model 
might use an asymmetric contaminated normal distribution or a symmetric long-tailed 
distribution in place of the normal measurement model. 
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Many other examples of posterior predictive checks appear throughout the book, includ- 
ing the educational testing example in Section 6.5, linear regressions examples in Sections 
14.3 and 15.2, and a hierarchical mixture model in Section 22.2. 

For many problems, it is useful to examine graphical comparisons of summaries of the 
data to summaries from posterior predictive simulations, as in Figure 6.3. In cases with 
less blatant discrepancies than the outliers in the speed of light data, it is often also useful 
to measure the ‘statistical significance’ of the lack of fit, a notion we formalize here. 


Notation for replications 


Let y be the observed data and 6 be the vector of parameters (including all the hyperpa- 
rameters if the model is hierarchical). To avoid confusion with the observed data, y, we 
define y™®P as the replicated data that could have been observed, or, to think predictively, as 
the data we would see tomorrow if the experiment that produced y today were replicated 
with the same model and the same value of 0 that produced the observed data. 

We distinguish between y™°P and y, our general notation for predictive outcomes: y is 
any future observable value or vector of observable quantities, whereas y"°P is specifically a 
replication just like y. For example, if the model has explanatory variables, x, they will be 
identical for y and y™°P, but y may have its own explanatory variables, 7. 

We will work with the distribution of y™°P given the current state of knowledge, that is, 
with the posterior predictive distribution 


ply"? |y) = T p(y"? |)p(6ly)a6. (6.1) 


Test quantities 


We measure the discrepancy between model and data by defining test quantities, the aspects 
of the data we wish to check. A test quantity, or discrepancy measure, T(y,@), is a scalar 
summary of parameters and data that is used as a standard when comparing data to 
predictive simulations. Test quantities play the role in Bayesian model checking that test 
statistics play in classical testing. We use the notation T(y) for a test statistic, which is a 
test quantity that depends only on data; in the Bayesian context, we can generalize test 
statistics to allow dependence on the model parameters under their posterior distribution. 
This can be useful in directly summarizing discrepancies between model and data. We 
discuss options for graphical test quantities in Section 6.4. The test quantities in this 
section are usually functions of y or replicated data y"*P. In the end of this section we 
briefly discuss a different sort of test quantities used for calibration that are functions of 
both y; and y; P (or ği). In Chapter 7 we discuss measures of discrepancy between model 
and data, that is, measures of predictive accuracy that are also functions of both y; and 
yf? (or Gi): 


Tail-area probabilities 


Lack of fit of the data with respect to the posterior predictive distribution can be measured 
by the tail-area probability, or p-value, of the test quantity, and computed using posterior 
simulations of (0, y™°P). We define the p-value mathematically, first for the familiar classical 
test and then in the Bayesian context. 


Classical p-values. The classical p-value for the test statistic T(y) is 


po = Pr(T(y"?) >T(y)|9), (6.2) 
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where the probability is taken over the distribution of y™°P with @ fixed. (The distribution of 
y’? given y and @ is the same as its distribution given 6 alone.) Test statistics are classically 
derived in a variety of ways but generally represent a summary measure of discrepancy 
between the observed data and what would be expected under a model with a particular 
value of 0. This value may be a ‘null’ value, corresponding to a ‘null hypothesis,’ or a point 
estimate such as the maximum likelihood value. A point estimate for 0 must be substituted 
to compute a p-value in classical statistics. 


Posterior predictive p-values. To evaluate the fit of the posterior distribution of a Bayesian 
model, we can compare the observed data to the posterior predictive distribution. In the 
Bayesian approach, test quantities can be functions of the unknown parameters as well as 
data because the test quantity is evaluated over draws from the posterior distribution of the 
unknown parameters. The Bayesian p-value is defined as the probability that the replicated 
data could be more extreme than the observed data, as measured by the test quantity: 


pp = Pr(T(y"®, 0) >T (y, 0)ly), 


where the probability is taken over the posterior distribution of 0 and the posterior predictive 
distribution of y™°P (that is, the joint distribution, p(0, y"°P|y)): 


_ J Ire 0)>T (ye PY” |O)P(Aly)ay? do, 


where J is the indicator function. In this formula, we have used the property of the predictive 
distribution that p(y"? |0, y) = p(y"? |@). 

In practice, we usually compute the posterior predictive distribution using simulation. 
If we already have S simulations from the posterior density of 0, we just draw one y’°P 
from the predictive distribution for each simulated 6; we now have S draws from the joint 
posterior distribution, p(y"°?, 6|y). The posterior predictive check is the comparison between 
the realized test quantities, T(y,0*°), and the predictive test quantities, T (y"°P?5, 0°). The 
estimated p-value is just the proportion of these S simulations for which the test quantity 
equals or exceeds its realized value; that is, for which T(y"°?*, 0°) >T(y,0*),s =1,...,8. 

In contrast to the classical approach, Bayesian model checking does not require special 
methods to handle ‘nuisance parameters’; by using posterior simulations, we implicitly 
average over all the parameters in the model. 


Example. Speed of light (continued) 

In Figure 6.3, we demonstrated the poor fit of the normal model to the speed of light 
data using min(y;) as the test statistic. We continue this example using other test 
quantities to illustrate how the fit of a model depends on the aspects of the data and 
parameters being monitored. Figure 6.4a shows the observed sample variance and the 
distribution of 200 simulated variances from the posterior predictive distribution. The 
sample variance does not make a good test statistic because it is a sufficient statistic of 
the model and thus, in the absence of an informative prior distribution, the posterior 
distribution will automatically be centered near the observed value. We are not at all 
surprised to find an estimated p-value close to 4. 

The model check based on min(y;) earlier in the chapter suggests that the normal 
model is inadequate. To illustrate that a model can be inadequate for some pur- 
poses but adequate for others, we assess whether the model is adequate except for 
the extreme tails by considering a model check based on a test quantity sensitive to 
asymmetry in the center of the distribution, 


T(y, 9) = lyn) — 8| — ly) — 8l. 


The 61st and 6th order statistics are chosen to represent approximately the 90% and 
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Figure 6.4 Realized vs. posterior predictive distributions for two more test quantities in the speed 
of light example: (a) Sample variance (vertical line at 115.5), compared to 200 simulations from 
the posterior predictive distribution of the sample variance. (b) Scatterplot showing prior and 
posterior simulations of a test quantity: T(y,?) = |ue — 0| — lyc) — 0| (horizontal azis) vs. 
T(y"?, 0) = lyen — | — Ivey — 9| (vertical axis) based on 200 simulations from the posterior 
distribution of (0,y"?). The p-value is computed as the proportion of points in the upper-left half 
of the scatterplot. 


10% points of the distribution. The test quantity should be scattered about zero 
for a symmetric distribution. The scatterplot in Figure 6.4b shows the test quantity 
for the observed data and the test quantity evaluated for the simulated data for 200 
simulations from the posterior distribution of (0,07). The estimated p-value is 0.26, 
implying that any observed asymmetry in the middle of the distribution can easily be 
explained by sampling variation. 


Choosing test quantities 


The procedure for carrying out a posterior predictive model check requires specifying a test 
quantity, T(y) or T(y,@), and an appropriate predictive distribution for the replications 
y™®P (which involves deciding which if any aspects of the data to condition on, as discussed 
at the end of Section 6.3). If T(y) does not appear to be consistent with the set of values 
T(yrP),...,T(y"PS), then the model is making predictions that do not fit the data. The 
discrepancy between T(y) and the distribution of T (y™°P) can be summarized by a p-value 
(as discussed in Section 6.3) but we prefer to look at the magnitude of the discrepancy as 
well as its p-value. 


Example. Checking the assumption of independence in binomial trials 

Consider a sequence of binary outcomes, y1,..., Yn, modeled as a specified number of 
independent trials with a common probability of success, 0, that is given a uniform 
prior distribution. As discussed in Chapter 2, the posterior density under the model 
is p(Aly) x 6%4¥(1 — @)"-*Y, which depends on the data only through the sufficient 
statistic, De Yi. Now suppose the observed data are, in order, 1, 1, 0, 0, 0, 0, 0, 
1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0. The observed autocorrelation is evidence that the 
model is flawed. To quantify the evidence, we can perform a posterior predictive test 
using the test quantity T = number of switches between 0 and 1 in the sequence. The 
observed value is T(y) = 3, and we can determine the posterior predictive distribution 
of T(y™P) by simulation. To simulate y™°P under the model, we first draw 0 from its 
Beta(8, 14) posterior distribution, then draw y"? = (y;°P,..., Yọ ) as independent 
Bernoulli variables with probability 6. Figure 6.5 displays a histogram of the values 
of T(y*°P*) for simulation draws s = 1,..., 10000, with the observed value, T(y) = 3, 
shown by a vertical line. The observed number of switches is about one-third as many 
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Figure 6.5 Observed number of switches (vertical line at T(y) = 3), compared to 10,000 simulations 
from the posterior predictive distribution of the number of switches, T (y™®). 


as would be expected from the model under the posterior predictive distribution, and 
the discrepancy cannot easily be explained by chance, as indicated by the computed 
p-value of ae To convert to a p-value near zero, we can change the sign of the 
test statistic, which amounts to computing Pr(T(y"?, 0) < T(y, @)|y), which is 0.028 
in this case. The p-values measured from the two ends have a sum that is greater than 
1 because of the discreteness of the distribution of T (y™°P). 


For many problems, a function of data and parameters can directly address a particular 
aspect of a model in a way that would be difficult or awkward using a function of data 
alone. If the test quantity depends on @ as well as y, then the test quantity T(y, @) as well 
as its replication T(y"®?,@) are unknowns and are represented by S simulations, and the 
comparison can be displayed either as a scatterplot of the values T(y,0*) vs. T(y™Ps, 0°) 
or a histogram of the differences, T (y, 0°) — T(y"°P*, 0°). Under the model, the scatterplot 
should be symmetric about the 45° line and the histogram should include 0. 

Because a probability model can fail to reflect the process that generated the data in any 
number of ways, posterior predictive p-values can be computed for a variety of test quantities 
in order to evaluate more than one possible model failure. Ideally, the test quantities T 
will be chosen to reflect aspects of the model that are relevant to the scientific purposes 
to which the inference will be applied. Test quantities are commonly chosen to measure a 
feature of the data not directly addressed by the probability model; for example, ranks of 
the sample, or correlation of residuals with some possible explanatory variable. 


Example. Checking the fit of hierarchical regression models for adolescent 
smoking 

We illustrate with a model fitted to a longitudinal dataset of about 2000 Australian 
adolescents whose smoking patterns were recorded every six months (via question- 
naire) for a period of three years. Interest lay in the extent to which smoking behavior 
could be predicted based on parental smoking and other background variables, and 
the extent to which boys and girls picked up the habit of smoking during their teenage 
years. Figure 6.6 illustrates the overall rate of smoking among survey participants, 
who had an average age of 14.9 years at the beginning of the study. 

We fit two models to these data. Our first model is a hierarchical logistic regression, 
in which the probability of smoking depends on sex, parental smoking, the wave of the 
study, and an individual parameter for the person. For person j at wave t, we model 
the probability of smoking as, 


Pr(yjt = 1) = logit™* (Bo + Bi Xj1 + BoX jo + B3(1 — Xja)t + BaXjot+aj;), (6.3) 


where Xj; is an indicator for parental smoking and X;2 is an indicator for females, 
so that 83 and 84 represent the time trends for males and females, respectively. The 
individual effects a; are assigned a N(0, 7°) distribution, with a noninformative uni- 
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Figure 6.6 Prevalence of regular (daily) smoking among participants responding at each wave in 
the study of Australian adolescents (who were on average 15 years old at wave 1). 


Model 1 Model 2 
95% int. p- 95% int. p- 
Test variable T(y) for T(y™®P) value for T(y™P) value 
% never-smokers 77.3 (75.5, 78.2] 0.27 (74.8, 79.9] 0.53 
% always-smokers 5.1 [5.0, 6.5] 0.95 [3.8, 6.3] 0.44 
% incident smokers 8.4 [5.3, 7.9] 0.005 [4.9, 7.8] 0.004 


Table 6.1 Summary of posterior predictive checks for three test statistics for two models fit to the 
adolescent smoking data: (1) hierarchical logistic regression, and (2) hierarchical logistic regression 
with a misture component for never-smokers. The second model better fits the percentages of never- 
and always-smokers, but still has a problem with the percentage of ‘incident smokers,’ who are 
defined as persons who report incidents of non-smoking followed by incidents of smoking. 


form prior distribution on 8,7. (See Chapter 22 for more on hierarchical generalized 
linear models.) 

The second model is an expansion of the first, in which each person j has an unobserved 
‘susceptibility’ status S; that equals 1 if the person might possibly smoke or 0 if he 
or she is ‘immune’ from smoking (that is, has no chance of becoming a smoker). 
This model is an oversimplification but captures the separation in the data between 
adolescents who often or occasionally smoke and those who never smoke at all. In this 
mixture model, the smoking status yj is automatically 0 at all times for nonsusceptible 
persons. For those persons with S; = 1, we use the model (6.3), understanding 
that these probabilities now refer to the probability of smoking, conditional on being 
susceptible. The model is completed with a logistic regression for susceptibility status 
given the individual-level predictors: Pr(.$; = 1) = logit (yo + 71X51 + 72X;2), and 
a uniform prior distribution on these coefficients y. 

Table 6.1 shows the results for posterior predictive checks of the two fitted models 
using three different test statistics T(y): 

e The percentage of adolescents in the sample who never smoked. 

e The percentage in the sample who smoked during all waves. 

e The percentage of ‘incident smokers’: adolescents who began the study as non- 
smokers, switched to smoking during the study period, and did not switch back. 
From the first column of Table 6.1, we see that 77% of the sample never smoked, 5% 
always smoked, and 8% were incident smokers. The table then displays the posterior 
predictive distribution of each test statistic under each of the two fitted models. Both 
models accurately capture the percentage of never-smokers, but the second model 
better fits the percentage of always-smokers. It makes sense that the second model 
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should fit this aspect of the data better, since its mixture form separates smokers from 
non-smokers. Finally, both models underpredict the proportion of incident smokers, 
which suggests that they are not completely fitting the variation of smoking behavior 
within individuals. 


Posterior predictive checking is a useful direct way of assessing the fit of the model to 
these various aspects of the data. Our goal here is not to compare or choose among the 
models (a topic we discuss in Section 7.3) but rather to explore the ways in which either or 
both models might be lacking. 

Numerical test quantities can also be constructed from patterns noticed visually (as in 
the test statistics chosen for the speed-of-light example in Section 6.3). This can be useful 
to quantify a pattern of potential interest, or to summarize a model check that will be 
performed repeatedly (for example, in checking the fit of a model that is applied to several 
different datasets). 


Multiple comparisons 


One might worry about interpreting the significance levels of multiple tests or of tests 
chosen by inspection of the data. For example, we looked at three different test variables in 
checking the adolescent smoking models, so perhaps it is less surprising than it might seem 
at first that the worst-fitting test statistic had a p-value of 0.005. A ‘multiple comparisons’ 
adjustment would calculate the probability that the most extreme p-value would be as low 
as 0.005, which would perhaps yield an adjusted p-value somewhere near 0.015. 

We do not make this adjustment, because we use predictive checks to see how particular 
aspects of the data would be expected to appear in replications. If we examine several test 
variables, we would not be surprised for some of them not to be fitted by the model—but 
if we are planning to apply the model, we might be interested in those aspects of the data 
that do not appear typical. We are not concerned with ‘Type I error’ rate—that is, the 
probability of rejecting a hypothesis conditional on it being true—because we use the checks 
not to accept or reject a model but rather to understand the limits of its applicability in 
realistic replications. In the setting where we are interested in making several comparisons 
at once, we prefer to directly make inferences on the comparisons using a multilevel model; 
see the discussion on page 96. 


Interpreting posterior predictive p-values 


A model is suspect if a discrepancy is of practical importance and its observed value has a 
tail-area probability near 0 or 1, indicating that the observed pattern would be unlikely to be 
seen in replications of the data if the model were true. An extreme p-value implies that the 
model cannot be expected to capture this aspect of the data. A p-value is a posterior prob- 
ability and can therefore be interpreted directly—although not as Pr(model is true | data). 
Major failures of the model, typically corresponding to extreme tail-area probabilities (less 
than 0.01 or more than 0.99), can be addressed by expanding the model appropriately. 
Lesser failures might also suggest model improvements or might be ignored in the short 
term if the failure appears not to affect the main inferences. In some cases, even extreme 
p-values may be ignored if the misfit of the model is substantively small compared to varia- 
tion within the model. We typically evaluate a model with respect to several test quantities, 
and we should be sensitive to the implications of this practice. 

If a p-value is close to 0 or 1, it is not so important exactly how extreme it is. A p-value 
of 0.00001 is virtually no stronger, in practice, than 0.001; in either case, the aspect of the 
data measured by the test quantity is inconsistent with the model. A slight improvement in 
the model (or correction of a data coding error!) could bring either p-value to a reasonable 
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range (between 0.05 and 0.95, say). The p-value measures ‘statistical significance,’ not 
‘practical significance.’ The latter is determined by how different the observed data are 
from the reference distribution on a scale of substantive interest and depends on the goal of 
the study; an example in which a discrepancy is statistically but not practically significant 
appears at the end of Section 14.3. 

The relevant goal is not to answer the question, ‘Do the data come from the assumed 
model?’ (to which the answer is almost always no), but to quantify the discrepancies between 
data and model, and assess whether they could have arisen by chance, under the model’s 
own assumptions. 


Limitations of posterior tests 


Finding an extreme p-value and thus ‘rejecting’ a model is never the end of an analysis; 
the departures of the test quantity in question from its posterior predictive distribution 
will often suggest improvements of the model or places to check the data, as in the speed 
of light example. Moreover, even when the current model seems appropriate for drawing 
inferences (in that no unusual deviations between the model and the data are found), 
the next scientific step will often be a more rigorous experiment incorporating additional 
factors, thereby providing better data. For instance, in the educational testing example of 
Section 5.5, the data do not allow rejection of the model that all the 0;’s are equal, but 
that assumption is clearly unrealistic, hence we do not restrict 7 to be zero. 

Finally, the discrepancies found in predictive checks should be considered in their applied 
context. A demonstrably wrong model can still work for some purposes, as we illustrate 
with a regression example in Section 14.3. 


P-values and u-values 


Bayesian predictive checking generalizes classical hypothesis testing by averaging over the 
posterior distribution of the unknown parameter vector 0 rather than fixing it at some 
estimate 6. Bayesian tests do not rely on the construction of pivotal quantities (that is, 
functions of data and parameters whose distributions are independent of the parameters of 
the model) or on asymptotic results, and are therefore applicable in general settings. This 
is not to suggest that the tests are automatic; as with classical testing, the choice of test 
quantity and appropriate predictive distribution requires careful consideration of the type 
of inferences required for the problem being considered. 

In the special case that the parameters 6 are known (or estimated to a very high preci- 
sion) or in which the test statistic T(y) is ancillary (that is, if it depends only on observed 
data and if its distribution is independent of the parameters of the model) with a continuous 
distribution, the posterior predictive p-value Pr(T'(y"*?) >T(y)|y) has a distribution that is 
uniform if the model is true. Under these conditions, p-values less than 0.1 occur 10% of 
the time, p-values less than 0.05 occur 5% of the time, and so forth. 

More generally, when posterior uncertainty in 6 propagates to the distribution of T(y|@), 
the distribution of the p-value, if the model is true, is more concentrated near the middle of 
the range: the p-value is more likely to be near 0.5 than near 0 or 1. (To be more precise, 
the sampling distribution of the p-value has been shown to be ‘stochastically less variable’ 
than uniform.) 

To clarify, we define a u-value as any function of the data y that has a U(0,1) sampling 
distribution. A u-value can be averaged over the distribution of 0 to give it a Bayesian 
flavor, but it is fundamentally not Bayesian, in that it cannot necessarily be interpreted as 
a posterior probability. In contrast, the posterior predictive p-value is such a probability 
statement, conditional on the model and data, about what might be expected in future 
replications. 
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The p-value is to the u-value as the posterior interval is to the confidence interval. Just 
as posterior intervals are not, in general, classical confidence intervals (in the sense of having 
the stated probability coverage conditional on any value of 0), Bayesian p-values are not 
generally u-values. 

This property has led some to characterize posterior predictive checks as conservative or 
uncalibrated. We do not think such labeling is helpful; rather, we interpret p-values directly 
as probabilities. The sample space for a posterior predictive check—the set of all possible 
events whose probabilities sum to 1—comes from the posterior distribution of y™°P. If a 
posterior predictive p-value is 0.4, say, that means that, if we believe the model, we think 
there is a 40% chance that tomorrow’s value of T(y™°P) will exceed today’s T(y). If we were 
able to observe such replications in many settings, and if our models were actually true, we 
could collect them and check that, indeed, this happens 40% of the time when the p-value 
is 0.4, that it happens 30% of the time when the p-value is 0.3, and so forth. These p-values 
are as calibrated as any other model-based probability, for example a statement such as, 
‘From a roll of this particular pair of loaded dice, the probability of getting double-sixes is 
0.11,’ or, ‘There is a 50% probability that Barack Obama won more than 52% of the white 
vote in Michigan in the 2008 election.’ 


Model checking and the likelihood principle 


In Bayesian inference, the data enter the posterior distribution only through the likelihood 
function (that is, those aspects of p(y|@) that depend on the unknown parameters 0); thus, 
it is sometimes stated as a principle that inferences should depend on the likelihood and no 
other aspects of the data. 

For a simple example, consider an experiment in which a random sample of 55 students 
is tested to see if their average score on a test exceeds a prechosen passing level of 80 points. 
Further assume the test scores are normally distributed and that some prior distribution 
has been set for u and g, the mean and standard deviation of the scores in the population 
from which the students were drawn. Imagine four possible ways of collecting the data: 
(a) simply take measurements on a random sample of 55 students; (b) randomly sample 
students in sequence and after each student cease collecting data with probability 0.02; (c) 
randomly sample students for a fixed amount of time; or (d) continue to randomly sample 
and measure individual students until the sample mean is significantly different from 80 
using the classical t-test. In designs (c) and (d), the number of measurements is a random 
variable whose distribution depends on unknown parameters. 

For the particular data at hand, these four very different measurement protocols cor- 
respond to different probability models for the data but identical likelihood functions, and 
thus Bayesian inference about u and o does not depend on how the data were collected—if 
the model is assumed to be true. But once we want to check the model, we need to sim- 
ulate replicated data, and then the sampling rule is relevant. For any fixed dataset y, the 
posterior inference p(j1,0|y) is the same for all these sampling models, but the distribution 
of replicated data, p(y"°?|u, 0) changes. Thus it is possible for aspects of the data to fit well 
under one data-collection model but not another, even if the likelihoods are the same. 


Marginal predictive checks 


So far in this section the focus has been on replicated data from the joint posterior predictive 
distribution. An alternative approach is to compute the probability distribution for each 
marginal prediction p(%;|y) separately and then compare these separate distributions to 
data in order to find outliers or check overall calibration. 

The tail-area probability can be computed for each marginal posterior predictive distri- 
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bution, 
pi = Pr(T(y;”) < T(yi)ly). 
If y; is scalar and continuous, a natural test quantity is T(y;) = yi, with tail-area probability, 


pi = Pry; < yily). 
For ordered discrete data we can compute a ‘mid’ p-value, 


re 1 re 
pi = Pr(y;” < yily) + 3P: = yilu). 


If we combine the checks from single data points, we will in general see different behavior 
than from the joint checks described in the previous section. Consider the educational 
testing example from Section 5.5: 


e Marginal prediction for each of the existing schools, using p(yi|y),7 = 1,...,8. If the 
population prior is noninformative or weakly informative, the center of the posterior 
predictive distribution will be close to y;, and the separate p-values p; will tend to 
concentrate near 0.5. In the extreme case of no pooling, the separate p-values will be 
exactly 0.5. 


e Marginal prediction for new schools p(j;|y),i = 9,...,16, comparing replications to the 
observed yi, yi = 1,...,8. Now the effect of single y; is smaller, working through the 
population distribution, and the p,’s have distributions that are closer to U(0, 1). 

A related approach is to replace predictive distributions with cross-validation predictive 

distributions, for each data point comparing to the inference given all the other data: 


pi = Pr(y;? < yly_z), 


where y_; contains all other data except y;. For continuous data, cross-validation predictive 
p-values have a uniform distribution if the model is calibrated. On the downside, cross- 
validation generally requires additional computation. In some settings, posterior predictive 
checking using the marginal predictions for new individuals with exactly the same predictors 
x; is called mixed predictive checking and can bridge the gap between cross-validation and 
full Bayesian predictive checking. We return to cross-validation in the next chapter. 

If the marginal posterior p-values concentrate near 0 and 1, the data are overdispersed 
compared to the model and if the p-values concentrate near 0.5 the data are underdispersed 
compared to the model. It may also be helpful to look at individual observations related 
to marginal posterior p-values close to 0 or 1. An alternative measure is the conditional 
predictive ordinate, 

CPO; = p(yily-i), 
which gives low values for unlikely observations given the current model. Examining unlikely 
observations could give insight into how to improve the model. In Chapter 17 we discuss 
how to make model inference more robust if the data have surprising ‘outliers.’ 


6.4 Graphical posterior predictive checks 


The basic idea of graphical model checking is to display the data alongside simulated data 

from the fitted model, and to look for systematic discrepancies between real and simulated 

data. This section gives examples of three kinds of graphical display: 

e Direct display of all the data (as in the comparison of the speed-of-light data in Figure 
3.1 to the 20 replications in Figure 6.2). 

e Display of data summaries or parameter inferences. This can be useful in settings where 
the dataset is large and we wish to focus on the fit of a particular aspect of the model. 


e Graphs of residuals or other measures of discrepancy between model and data. 
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Figure 6.7 Left column displays observed data y (a 15 x 23 array of binary responses from each 
of 6 persons); right columns display seven replicated datasets y"°? from a fitted logistic regression 
model. A misfit of model to data is apparent: the data show strong row and column patterns for 
individual persons (for example, the nearly white row near the middle of the last person’s data) that 
do not appear in the replicates. (To make such patterns clearer, the indexes of the observed and 
each replicated dataset have been arranged in increasing order of average response.) 


Direct data display 


Figure 6.7 shows another example of model checking by displaying all the data. The left 
column of the figure displays a three-way array of binary data—for each of 6 persons, a 
possible ‘yes’ or ‘no’ to each of 15 possible reactions (displayed as rows) to 23 situations 
(columns)—from an experiment in psychology. The three-way array is displayed as 6 slices, 
one for each person. Before displaying, the reactions, situations, and persons have been 
ordered in increasing average response. We can thus think of the test statistic T(y) as 
being this graphical display, complete with the ordering applied to the data y. 

The right columns of Figure 6.7 display seven independently simulated replications y*°P 
from a fitted logistic regression model (with the rows, columns, and persons for each dataset 
arranged in increasing order before display, so that we are displaying T (y™°P) in each case). 
Here, the replicated datasets look fuzzy and ‘random’ compared to the observed data, which 
have strong rectilinear structures that are clearly not captured in the model. If the data 
were actually generated from the model, the observed data on the left would fit right in 
with the simulated datasets on the right. 

These data have enough internal replication that the model misfit would be clear in 
comparison to a single simulated dataset from the model. But, to be safe, it is good to 
compare to several replications to see if the patterns in the observed data could be expected 
to occur by chance under the model. 

Displaying data is not simply a matter of dumping a set of numbers on a page (or 
a screen). For example, we took care to align the graphs in Figure 6.7 to display the 
three-dimensional dataset and seven replications at once without confusion. Even more 
important, the arrangement of the rows, columns, and persons in increasing order is crucial 
to seeing the patterns in the data over and above the model. To see this, consider Figure 
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Figure 6.8 Redisplay of Figure 6.7 without ordering the rows, columns, and persons in order of 
increasing response. Once again, the left column shows the observed data and the right columns show 
replicated datasets from the model. Without the ordering, it is difficult to notice the discrepancies 
between data and model, which are easily apparent in Figure 6.7. 


6.8, which presents the same information as in Figure 6.7 but without the ordering. Here, 
the discrepancies between data and model are not clear at all. 


Displaying summary statistics or inferences 


A key principle of exploratory data analysis is to exploit regular structure to display data 
more effectively. The analogy in modeling is hierarchical or multilevel modeling, in which 
batches of parameters capture variation at different levels. When checking model fit, hier- 
archical structure can allow us to compare batches of parameters to their reference distri- 
bution. In this scenario, the replications correspond to new draws of a batch of parameters. 

We illustrate with inference from a hierarchical model from psychology. This was a 
fairly elaborate model, whose details we do not describe here; all we need to know for this 
example is that the model included two vectors of parameters, ¢1,...,@90, and #1,..., Weg, 
corresponding to patients and psychological symptoms, and that each of these 159 param- 
eters were assigned independent Beta(2,2) prior distributions. Each of these parameters 
represented a probability that a given patient or symptom is associated with a particular 
psychological syndrome. 

Data were collected (measurements of which symptoms appeared in which patients) and 
the full Bayesian model was fitted, yielding posterior simulations for all these parameters. 
If the model were true, we would expect any single simulation draw of the vectors of patient 
parameters ¢ and symptom parameters ~ to look like independent draws from the Beta(2, 2) 
distribution. We know this because of the following reasoning: 


e If the model were indeed true, we could think of the observed data vector y and the 
vector 0 of the true values of all the parameters (including ¢ and w) as a random draw 
from their joint distribution, p(y, 0). Thus, y comes from the marginal distribution, the 
prior predictive distribution, p(y). 


e A single draw 6° from the posterior inference comes from p(@*|y). Since y ~ p(y), this 
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Figure 6.9 Histograms of (a) 90 patient parameters and (b) 69 symptom parameters, from a single 
draw from the posterior distribution of a psychometric model. These histograms of posterior esti- 
mates contradict the assumed Beta(2,2) prior densities (overlain on the histograms) for each batch 
of parameters, and motivated us to switch to mixture prior distributions. This implicit comparison 
to the values under the prior distribution can be viewed as a posterior predictive check in which a 
new set of patients and a new set of symptoms are simulated. 


means that y,0° come from the model’s joint distribution of y,#, and so the marginal 
distribution of 6° is the same as that of 0. 


e That is, y,6,0° have a combined joint distribution in which 0 and 6° have the same 
marginal distributions (and the same joint distributions with y). 


Thus, as a model check we can plot a histogram of a single simulation of the vector of 
parameters ¢ or yw and compare to the prior distribution. This corresponds to a posterior 
predictive check in which the inference from the observed data is compared to what would 
be expected if the model were applied to a new set of patients and a new set of symptoms. 

Figure 6.9 shows histograms of a single simulation draw for each of ¢ and ~ as fitted to 
our dataset. The lines show the Beta(2, 2) prior distribution, which clearly does not fit. For 
both ¢ and w, there are too many cases near zero, corresponding to patients and symptoms 
that almost certainly are not associated with a particular syndrome. 

Our next step was to replace the offending Beta(2,2) prior distributions by mixtures of 
two beta distributions—one distribution with a spike near zero, and another that is uniform 
between 0 and 1—with different models for the ¢’s and the w’s. The exact model is, 


plj) = 0.5 Beta(¢,|1, 6) + 0.5 Beta(¢,|1, 1) 
p(w;) = 0.5 Beta(w,|1, 16) + 0.5 Beta(q,|1, 1). 


We set the parameters of the mixture distributions to fixed values based on our under- 
standing of the model. It was reasonable for these data to suppose that any given symptom 
appeared only about half the time; however, labeling of the symptoms is subjective, so we 
used beta distributions peaked near zero but with some probability of taking small pos- 
itive values. We assigned the Beta(1,1) (that is, uniform) distributions for the patient 
and symptom parameters that were not near zero—given the estimates in Figure 6.9, these 
seemed to fit the data better than the original Beta(2,2) models. (The original reason for 
using Beta(2,2) rather than uniform prior distributions was so that maximum likelihood 
estimates would be in the interior of the interval [0, 1], a concern that disappeared when we 
moved to Bayesian inference; see Exercise 4.9.) 

Some might object to revising the prior distribution based on the fit of the model to 
the data. It is, however, consistent with common statistical practice, in which a model is 
iteratively altered to provide a better fit to data. The natural next step would be to add 
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Figure 6.10 Histograms of (a) 90 patient parameters and (b) 69 symptom parameters, as estimated 
from an expanded psychometric model. The mixture prior densities (overlain on the histograms) 
are not perfect, but they approximate the corresponding histograms much better than the Beta(2, 2) 
densities in Figure 6.9. 


a hierarchical structure, with hyperparameters for the mixture distributions for the patient 
and symptom parameters. This would require additional computational steps and potential 
new modeling difficulties (for example, instability in the estimated hyperparameters). Our 
main concern in this problem was to reasonably model the individual ¢; and Y; parameters 
without the prior distributions inappropriately interfering (which appears to be happening 
in Figure 6.9). 

We refitted the model with the new prior distribution and repeated the model check, 
which is displayed in Figure 6.10. The fit of the prior distribution to the inferences is not 
perfect but is much better than before. 


Residual plots and binned residual plots 


Bayesian residuals. Linear and nonlinear regression models, which are the core tools of 
applied statistics, are characterized by a function g(x, 0) = E(y|x,@), where x is a vector of 
predictors. Then, given the unknown parameters 0 and the predictors x; for a data point 
Yi, the predicted value is g(a;,@) and the residual is y; — g(x;,0). This is sometimes called 
a ‘realized’ residual in contrast to the classical or estimated residual, y; — g(2i, ô), which is 
based on a point estimate Ê of the parameters. 

A Bayesian residual graph plots a single realization of the residuals (based on a single 
random draw of 0). An example appears on page 484. Classical residual plots can be 
thought of as approximations to the Bayesian version, ignoring posterior uncertainty in 0. 


Binned residuals for discrete data. Unfortunately, for discrete data, plots of residuals can 
be difficult to interpret because, for any particular value of E(y;|x,@), the residual r; can 
only take on certain discrete values; thus, even if the model is correct, the residuals will 
not generally be expected to be independent of predicted values or covariates in the model. 
Figure 6.11 illustrates with data and then residuals plotted vs. fitted values, for a model of 
pain relief scores, which were discretely reported as 0, 1, 2, 3, or 4. The residuals have a 
distracting striped pattern because predicted values plus residuals equal discrete observed 
data values. 

A standard way to make discrete residual plots more interpretable is to work with binned 
or smoothed residuals, which should be closer to symmetric about zero if enough residuals 
are included in each bin or smoothing category (since the expectation of each residual is by 
definition zero, the central limit theorem ensures that the distribution of averages of many 
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Figure 6.11 (a) Residuals (observed — expected) vs. expected values for a model of pain relief scores 
(0 = no pain relief, ..., 5 = complete pain relief). (b) Average residuals vs. expected pain scores, 
with measurements divided into 20 equally sized bins defined by ranges of expected pain scores. The 
average prediction errors are relatively small (note the scale of the y-axis), but with a consistent 
pattern that low predictions are too low and high predictions are too high. Dotted lines show 95% 
bounds under the model. 


residuals will be approximately symmetric). In particular, suppose we would like to plot 
the vector of residuals r vs. some vector w = (w1,..., Wn) that can in general be a function 
of x, 0, and perhaps y. We can bin the predictors and residuals by ordering the n values of 
w; and sorting them into bins k = 1,..., K, with approximately equal numbers of points 
nk in each bin. For each bin, we then compute Ùp and Fk, the average values of w; and ri, 
respectively, for points 7 in bin k. The binned residual plot is the plot of the points Fk vs. 
Wr, which actually must be represented by several plots (which perhaps can be overlain) 
representing variability due to uncertainty of 0 in the posterior distribution. 

Since we are viewing the plot as a test variable, it must be compared to the distribution 
of plots of 7} P vs. w, ”, where, for each simulation draw, the values of F,” are computed by 
averaging the replicated residuals r} = y;°? — E(y;|x, 9) for points i in bin k. In general, 
the values of w; can depend on y, and so the bins and the values of w,? can vary among 
the replicated datasets. 

Because we can compare to the distribution of simulated replications, the question arises: 
why do the binning at all? We do so because we want to understand the model misfits 
that we detect. Because of the discreteness of the data, the individual residuals r; have 
asymmetric discrete distributions. As expected, the binned residuals are approximately 
symmetrically distributed. In general it is desirable for the posterior predictive reference 
distribution of a discrepancy variable to exhibit some simple features (in this case, inde- 
pendence and approximate normality of the 7,’s) so that there is a clear interpretation of 
a misfit. This is, in fact, the same reason that one plots residuals, rather than data, vs. 
predicted values: it is easier to compare to an expected horizontal line than to an expected 
45° line. 

Under the model, the residuals are independent and, if enough are in each bin, the mean 
residuals 7, are approximately normally distributed. We can then display the reference 
distribution as 95% error bounds, as in Figure 6.11b. We never actually have to display 
the replicated data; the replication distribution is implicit, given our knowledge that the 
binned residuals are independent, approximately normally distributed, and with expected 
variation as shown by the error bounds. 


General interpretation of graphs as model checks 


More generally, we can compare any data display to replications under the model—not 
necessarily as an explicit model check but more to understand what the display ‘should’ 
look like if the model were true. For example, the maps and scatterplots of high and 
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low cancer rates (Figures 2.6-2.8) show strong patterns, but these are not particularly 
informative if the same patterns would be expected of replications under the model. The 
erroneous initial interpretation of Figure 2.6—as evidence of a pattern of high cancer rates 
in the sparsely populated areas in the center-west of the country—can be thought of as an 
erroneous model check, in which the data display was compared to a random pattern rather 
than to the pattern expected under a reasonable model of variation in cancer occurrences. 


6.5 Model checking for the educational testing example 


We illustrate the ideas of this chapter with the example from Section 5.5. 


Assumptions of the model 


The inference presented for the 8 schools example is based on several model assumptions: (1) 
normality of the estimates y; given 0; and øj, where the values a; are assumed known; (2) 
exchangeability of the prior distribution of the 6;’s; (3) normality of the prior distribution 
of each 0; given u and 7; and (4) uniformity of the hyperprior distribution of (u, T). 

The assumption of normality with a known variance is made routinely when a study is 
summarized by its estimated effect and standard error. The design (randomization, reason- 
ably large sample sizes, adjustment for scores on earlier tests) and analysis (for example, 
the raw data of individual test scores were checked for outliers in an earlier analysis) were 
such that the assumptions seem justifiable in this case. 

The second modeling assumption deserves commentary. The real-world interpretation of 
the mathematical assumption of exchangeability of the 6;’s is that before seeing the results 
of the experiments, there is no desire to include in the model features such as a belief that 
(a) the effect in school A is probably larger than in school B or (b) the effects in schools 
A and B are more similar than in schools A and C. In other words, the exchangeability 
assumption means that we will let the data tell us about the relative ordering and similarity 
of effects in the schools. Such a prior stance seems reasonable when the results of eight 
parallel experiments are being scientifically summarized for general presentation. Generally 
accepted information concerning the effectiveness of the programs or differences among the 
schools might suggest a nonexchangeable prior distribution if, for example, schools B and 
C have similar students and schools A, D, E, F, G, H have similar students. Unusual 
types of detailed prior knowledge (for example, two schools are similar but we do not 
know which schools they are) can suggest an exchangeable prior distribution that is not 
a mixture of independent and identically distributed components. In the absence of any 
such information, the exchangeability assumption implies that the prior distribution of the 
0;’s can be considered as independent samples from a population whose distribution is 
indexed by some hyperparameters—in our model, (u, T)—that have their own hyperprior 
distribution. 

The third and fourth modeling assumptions are harder to justify a priori than the 
first two. Why should the school effects be normally distributed rather than say, Cauchy 
distributed, or even asymmetrically distributed, and why should the location and scale 
parameters of this prior distribution be uniformly distributed? Mathematical tractability is 
one reason for the choice of models, but if the family of probability models is inappropriate, 
Bayesian answers can be misleading. 


Comparing posterior inferences to substantive knowledge 


Inference about the parameters in the model. When checking the model assumptions, our 
first step is to compare the posterior distribution of effects to our knowledge of educational 
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testing. The estimated treatment effects (the posterior means) for the eight schools range 
from 5 to 10 points, which are plausible values. (The scores on the test range can range from 
200 to 800.) The effect in school A could be as high as 31 points or as low as —2 points (a 
95% posterior interval). Either of these extremes seems plausible. We could look at other 
summaries as well, but it seems clear that the posterior estimates of the parameters do 
not violate our common sense or our limited substantive knowledge about test preparation 
courses. 


Inference about predicted values. Next, we simulate the posterior predictive distribution 
of a hypothetical replication of the experiments. Sampling from the posterior predictive 
distribution is nearly effortless given all that we have done so far: from each of the 200 
simulations from the posterior distribution of (9, u, T), we simulate a hypothetical replicated 
dataset, y™°P = (y1 ”,..., yg”), by drawing each y} ” from a normal distribution with mean 
6; and standard deviation gj. The resulting set of 200 vectors y"°P summarizes the posterior 
predictive distribution. (Recall from Section 5.5 that we are treating y—the eight separate 
estimates—as the ‘raw data’ from the eight experiments.) 

The model-generated parameter values 0; for each school in each of the 200 replications 
are all plausible outcomes of experiments on coaching. The simulated hypothetical obser- 
vation yj” range from —48 to 63; again, we find these possibilities to be plausible given our 
general understanding of this area. 


Posterior predictive checking 


If the fit to data shows serious problems, we may have cause to doubt the inferences ob- 
tained under the model such as displayed in Figure 5.8 and Table 5.3. For instance, how 
consistent is the largest observed outcome, 28 points, with the posterior predictive distri- 
bution under the model? Suppose we perform 200 posterior predictive simulations of the 
coaching experiments and compute the largest observed outcome, max; y} ”, for each. If 
all 200 of these simulations lie below 28 points, then the model does not fit this important 
aspect of the data, and we might suspect that the normal-based inference in Section 5.5 
shrinks the effect in School A too far. 

To test the fit of the model to data, we examine the posterior predictive distribution 
of the following four test statistics: the largest of the 8 observed outcomes, max; yj, the 
smallest, min, yj, the average, mean(y;), and the sample standard deviation, sd(y;). We 
approximate the posterior predictive distribution of each test statistic by the histogram of 
the values from the 200 simulations of the parameters and predictive data, and we compare 
each distribution to the observed value of the test statistic and our substantive knowledge 
of SAT coaching programs. The results are displayed in Figure 6.12. 

The summaries suggest that the model generates predicted results similar to the observed 
data in the study; that is, the actual observations are typical of the predicted observations 
generated by the model. 

Many other functions of the posterior predictive distribution could be examined, such 
as the differences between individual values of y;"?. Or, if we had a particular skewed prior 
distribution in mind for the effects 0j, we could construct a test quantity based on the skew- 
ness or asymmetry of the simulated predictive data as a check on the normal model. Often 
in practice we can obtain diagnostically useful displays directly from intuitively interesting 
quantities without having to supply a specific alternative model. 


Sensitivity analysis 


The model checks seem to support the posterior inferences for the educational testing ex- 
ample. Although we may feel confident that the data do not contradict the model, this is 
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Figure 6.12 Posterior predictive distribution, observed result, and p-value for each of four test 
statistics for the educational testing example. 


not enough to inspire complete confidence in our general substantive conclusions, because 
other reasonable models might provide just as good a fit but lead to different conclusions. 
Sensitivity analysis can then be used to assess the effect of alternative analyses on the 
posterior inferences. 


The uniform prior distribution for r. To assess the sensitivity to the prior distribution 
for 7 we consider Figure 5.5, the graph of the marginal posterior density, p(r|y), obtained 
under the assumption of a uniform prior density for T on the positive half of the real line. 
One can obtain the posterior density for T given other choices of the prior distribution by 
multiplying the density displayed in Figure 5.5 by the prior density. There will be little 
change in the posterior inferences as long as the prior density is not sharply peaked and 
does not put a great deal of probability mass on values of 7 greater than 10. 


The normal population distribution for the school effects. The normal distribution as- 
sumption on the @;’s is made for computational convenience, as is often the case. A natural 
sensitivity analysis is to consider longer-tailed alternatives, such as the t, as a check on 
robustness. We defer the details of this analysis to Section 17.4, after the required com- 
putational techniques have been presented. Any alternative model must be examined to 
ensure that the predictive distributions are restricted to realistic SAT improvements. 


The normal likelihood. As discussed earlier, the assumption of normal data conditional 
on the means and standard deviations need not and cannot be seriously challenged in this 
example. The justification is based on the central limit theorem and the designs of the 
studies. Assessing the validity of this assumption would require access to the original data 
from the eight experiments, not just the estimates and standard errors given in Table 5.2. 


6.6 Bibliographic note 


The posterior predictive approach to model checking described here was presented in Rubin 
(1981a, 1984). Gelman, Meng, and Stern (1996) discuss the use of test quantities that 
depend on parameters as well as data; related ideas appear in Zellner (1976) and Tsui 
and Weerahandi (1989). Rubin and Stern (1994) and Raghunathan (1994) provide further 
applied examples. The examples in Section 6.4 appear in Meulders et al. (1998) and Gelman 
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(2003). The antisymmetric discrepancy measures discussed in Section 6.3 appear in Berkhof, 
Van Mechelen, and Gelman (2003). The adolescent smoking example appears in Carlin et 
al. (2001). Sinharay and Stern (2003) discuss posterior predictive checks for hierarchical 
models, focusing on the SAT coaching example. Johnson (2004) discusses Bayesian y? tests 
as well as the idea of using predictive checks as a debugging tool, as discussed in Section 
10.7. Bayarri and Castellanos (2007) present a slightly different perspective on posterior 
predictive checking, to which we respond in Gelman (2007b). Gelman (2013b) discusses the 
statistical properties of posterior predictive p-values in two simple examples. 

Model checking using simulation has a long history in statistics; for example, Bush and 
Mosteller (1955, p. 252) check the fit of a model by comparing observed data to a set of 
simulated data. Their method differs from posterior predictive checking only in that their 
model parameters were fixed at point estimates for the simulations rather than being drawn 
from a posterior distribution. Ripley (1988) applies this idea repeatedly to examine the fits 
of models for spatial data. Early theoretical papers featuring ideas related to Bayesian 
posterior predictive checks include Guttman (1967) and Dempster (1971). Bernardo and 
Smith (1994) discuss methods of comparing models based on predictive errors. Gelman, 
Van Mechelen, et al. (2005) consider Bayesian model checking in the presence of missing 
data. O’Hagan (2003) discusses tools for measuring conflict between information from prior 
and likelihood at any level of hierarchical model. Shirley and Gelman (2014) demonstrate 
a number of graphical displays for use in understanding a fitted hierarchical model. 

Gelfand, Dey, and Chang, (1992) and Gelfand (1996) discuss cross-validation predictive 
checks. Gelman, Meng, and Stern (1996) use the term ‘mixed predictive check’ if direct pa- 
rameters are replicated from their prior given the posterior samples for the hyperparameters 
(predictions for new groups). Marshall and Spiegelhalter (2007) discuss different posterior, 
mixed, and cross-validation predictive checks for outlier detection. Gneiting, Balabdaoui, 
and Raftery (2007) discuss test quantities for marginal predictive calibration. 

Box (1980, 1983) has contributed a wide-ranging discussion of model checking (‘model 
criticism’ in his terminology), including a consideration of why it is needed in addition to 
model expansion and averaging. Box proposed checking models by comparing data to the 
prior predictive distribution; in the notation of our Section 6.3, defining replications with 
distribution p(y"?) = [p(y"®?|0)p(8)d0. This approach has different implications for model 
checking; for example, with an improper prior distribution on 0, the prior predictive distri- 
bution is itself improper and thus the check is not generally defined, even if the posterior 
distribution is proper (see Exercise 6.7). 

Box was also an early contributor to the literature on sensitivity analysis and robustness 
in standard models based on normal distributions: see Box and Tiao (1962, 1973). Various 
theoretical studies have been performed on Bayesian robustness and sensitivity analysis 
examining the question of how posterior inferences are affected by prior assumptions; see 
Leamer (1978b), McCulloch (1989), Wasserman (1992), and the references at the end of 
Chapter 17. Kass and coworkers have developed methods based on Laplace’s approximation 
for approximate sensitivity analysis: for example, see Kass, Tierney, and Kadane (1989) 
and Kass and Vaidyanathan (1992). 

Finally, many model checking methods in common practical use, including tests for 
outliers, plots of residuals, and normal plots, can be interpreted as Bayesian posterior 
predictive checks, where the practitioner is looking for discrepancies from the expected 
results under the assumed model (see Gelman, 2003, for an extended discussion of this 
point). Many non-Bayesian treatments of graphical model checking appear in the statistical 
literature, for example, Atkinson (1985). Tukey (1977) presents a graphical approach to 
data analysis that is, in our opinion, fundamentally based on model checking (see Gelman, 
2003 and Gelman, 2004a). The books by Cleveland (1985, 1993) and Tufte (1983, 1990) 
present many useful ideas for displaying data graphically; these ideas are fundamental to 
the graphical model checks described in Section 6.4. 
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6.7 Exercises 


1. Posterior predictive checking: 


(a) On page 120, the data from the SAT coaching experiments were checked against the 
model that assumed identical effects in all eight schools: the expected order statistics 
of the effect sizes were (26, 19, 14, 10, 6, 2, —3, —9), compared to observed data of (28, 
18, 12, 8, 7, 1, —1, —3). Express this comparison formally as a posterior predictive 
check comparing this model to the data. Does the model fit the aspect of the data 
tested here? 


(b) Explain why, even though the identical-schools model fits under this test, it is still 
unacceptable for some practical purposes. 


2. Model checking: in Exercise 2.13, the counts of airline fatalities in 1976-1985 were fitted 
to four different Poisson models. 


(a) For each of the models, set up posterior predictive test quantities to check the following 
assumptions: (1) independent Poisson distributions, (2) no trend over time. 


(b) For each of the models, use simulations from the posterior predictive distributions to 
measure the discrepancies. Display the discrepancies graphically and give p-values. 


(c) Do the results of the posterior predictive checks agree with your answers in Exercise 
2.13(e)? 


3. Model improvement: 


(a) Use the solution to the previous problem and your substantive knowledge to construct 
an improved model for airline fatalities. 


(b) Fit the new model to the airline fatality data. 


(c) Use your new model to forecast the airline fatalities in 1986. How does this differ from 
the forecasts from the previous models? 


(d) Check the new model using the same posterior predictive checks as you used in the 
previous models. Does the new model fit better? 


4. Model checking and sensitivity analysis: find a published Bayesian data analysis from 
the statistical literature. 


(a) Compare the data to posterior predictive replications of the data. 


(b) Perform a sensitivity analysis by computing posterior inferences under plausible alter- 
native models. 


5. Hypothesis testing: discuss the statement, ‘Null hypotheses of no difference are usually 
known to be false before the data are collected; when they are, their rejection or ac- 
ceptance simply reflects the size of the sample and the power of the test, and is not a 
contribution to science’ (Savage, 1957, quoted in Kish, 1965). If you agree with this 
statement, what does this say about the model checking discussed in this chapter? 


6. Variety of predictive reference sets: in the example of binary outcomes on page 147, it is 
assumed that the number of measurements, n, is fixed in advance, and so the hypothetical 
replications under the binomial model are performed with n = 20. Suppose instead that 
the protocol for measurement is to stop once 13 zeros have appeared. 


(a) Explain why the posterior distribution of the parameter 0 under the assumed model 
does not change. 


(b) Perform a posterior predictive check, using the same test quantity, T = number of 
switches, but simulating the replications y*°P under the new measurement protocol. 
Display the predictive simulations, T(y"®?), and discuss how they differ from Figure 
6.5. 
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7. Prior vs. posterior predictive checks (from Gelman, Meng, and Stern, 1996): consider 
100 observations, y1,..., Yn, modeled as independent samples from a N(@, 1) distribution 
with a diffuse prior distribution, say, p(@) = A for 0 € [—A, A] with some extremely 
large value of A, such as 10°. We wish to check the model using, as a test statistic, 
T(y) = max;|y;|: is the maximum absolute observed value consistent with the normal 
model? Consider a dataset in which y = 5.1 and T(y) = 8.1. 


(a) What is the posterior predictive distribution for y’°?? Make a histogram for the 
posterior predictive distribution of T(y"°P) and give the posterior predictive p-value 
for the observation T(y) = 8.1. 

(b) The prior predictive distribution is p(y"°P) = f p(y"®?|0)p(8)dé. (Compare to equation 
(6.1).) What is the prior predictive distribution for y™°P in this example? Roughly 
sketch the prior predictive distribution of T(y"®?) and give the approximate prior 
predictive p-value for the observation T(y) = 8.1. 

(c) Your answers for (a) and (b) should show that the data are consistent with the pos- 
terior predictive but not the prior predictive distribution. Does this make sense? 
Explain. 

8. Variety of posterior predictive distributions: for the educational testing example in Sec- 
tion 6.5, we considered a reference set for the posterior predictive simulations in which 
0 = (61,...,0s) was fixed. This corresponds to a replication of the study with the same 
eight coaching programs. 

(a) Consider an alternative reference set, in which (u, T) are fixed but 0 is allowed to vary. 
Define a posterior predictive distribution for y™°P under this replication, by analogy 
to (6.1). What is the experimental replication that corresponds to this reference set? 

(b) Consider switching from the analysis of Section 6.5 to an analysis using this alternative 
reference set. Would you expect the posterior predictive p-values to be less extreme, 
more extreme, or stay about the same? Why? 

(c) Reproduce the model checks of Section 6.5 based on this posterior predictive distri- 
bution. Compare to your speculations in part (b). 


9. Model checking: check the assumed model fitted to the rat tumor data in Section 5.3. 
Define some test quantities that might be of scientific interest, and compare them to 
their posterior predictive distributions. 

10. Checking the assumption of equal variance: Figures 1.1 and 1.2 on pages 14 and 15 
display data on point spreads x and score differentials y of a set of professional football 
games. (The data are available at http://www.stat.columbia.edu/~gelman/book/.) 
In Section 1.6, a model is fit of the form, y ~ N(x, 147). However, Figure 1.2a seems to 
show a pattern of decreasing variance of y — x as a function of zx. 


(a) Simulate several replicated datasets y™°P under the model and, for each, create graphs 
like Figures 1.1 and 1.2. Display several graphs per page, and compare these to the 
corresponding graphs of the actual data. This is a graphical posterior predictive check 
as described in Section 6.4. 

(b) Create a numerical summary T(x, y) to capture the apparent decrease in variance of 
y — «x as a function of x. Compare this to the distribution of simulated test statistics, 
T(x, y™®P) and compute the p-value for this posterior predictive check. 
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Chapter 7 


Evaluating, comparing, and expanding models 


In the previous chapter we discussed discrepancy measures for checking the fit of model 
to data. In this chapter we seek not to check models but to compare them and explore 
directions for improvement. Even if all of the models being considered have mismatches 
with the data, it can be informative to evaluate their predictive accuracy and consider 
where to go next. The challenge we focus on here is the estimation of the predictive model 
accuracy, correcting for the bias inherent in evaluating a model’s predictions of the data 
that were used to fit it. 


We proceed as follows. First we discuss measures of predictive fit, using a small linear 
regression as a running example. We consider the differences between external validation, 
fit to training data, and cross-validation in Bayesian contexts. Next we describe information 
criteria, which are estimates and approximations to cross-validated or externally validated 
fit, used for adjusting for overfitting when measuring predictive error. Section 7.3 considers 
the use of predictive error measures for model comparison using the 8-schools model as an 
example. The chapter continues with Bayes factors and continuous model expansion and 
concludes in Section 7.6 with an extended discussion of robustness to model assumptions 
in the context of a simple but nontrivial sampling example. 


Example. Forecasting presidential elections 

We shall use a simple linear regression as a running example. Figure 7.1 shows a 
quick summary of economic conditions and presidential elections over the past several 
decades. It is based on the ‘bread and peace’ model created by political scientist 
Douglas Hibbs to forecast elections based solely on economic growth (with corrections 
for wartime, notably Adlai Stevenson’s exceptionally poor performance in 1952 and 
Hubert Humphrey’s loss in 1968, years when Democrats were presiding over unpopular 
wars). Better forecasts are possible using additional information such as incumbency 
and opinion polls, but what is impressive here is that this simple model does pretty 
well all by itself. 

For simplicity, we predict y (vote share) solely from x (economic performance), us- 
ing a linear regression, y ~ N(a + bz,o7), with a noninformative prior distribution, 
p(a, b, logo) œ 1. Although these data form a time series, we are treating them here 
as a simple regression problem. 

The posterior distribution for linear regression with a conjugate prior is normal- 
inverse-y?. We go over the derivation in Chapter 14; here we quickly present the 
distribution for our two-coefficient example so that we can go forward and use this 
distribution in our predictive error measures. The posterior distribution is most con- 
veniently factored as p(a, b, 0?|y) = p(o?|y)p(a, bla”, y): 


e The marginal posterior distribution of the variance parameter is 


oly ~ Inv-x2(n — J, 3è), 
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Forecasting elections from the economy 


Income growth Incumbent party's share of the popular vote 

Johnson vs. Goldwater (1964) more than 4% e 
Reagan vs. Mondale (1984) ° 
Nixon vs. McGovern (1972) 3% to 4% ° 
Humphrey vs. Nixon (1968) °) 
Eisenhower vs. Stevenson (1956) ! e 
Stevenson vs. Eisenhower (1952) ° ! 

2% t0 3% l 


Gore vs. Bush, Jr. (2000) e 
Bush, Sr. vs. Dukakis (1988) 
Bush, Jr. vs. Kerry (2004) 

Ford vs. Carter (1976) 1% to 2% ° 


Clinton vs. Dole (1996) 


Nixon vs. Kennedy (1960) 

Bush, Sr. vs. Clinton (1992) 0% to 1% ° 
McCain vs. Obama (2008) ° 
Carter vs. Reagan (1980) negative e 


e 
45% 50% 55% 60% 


Above matchups are all listed as incumbent party's candidate vs. other party's candidate. 
Income growth is a weighted measure over the four years preceding the election. Vote share excludes third parties. 


Figure 7.1 Douglas Hibbs’s ‘bread and peace’ model of voting and the economy. Presidential elec- 
tions since 1952 are listed in order of the economic performance at the end of the preceding ad- 
ministration (as measured by inflation-adjusted growth in average personal income). The better 
the economy, the better the incumbent party’s candidate generally does, with the biggest exceptions 
being 1952 (Korean War) and 1968 (Vietnam War). 


where 


= — u — XB)" (y — XÊ), 


and X is the n x J matrix of predictors, in this case the 15 x 2 matrix whose first 
column is a column of 1’s and whose second column is the vector x of economic 
performance numbers. 


e The conditional posterior distribution of the vector of coefficients, 6 = (a, b), is 


Blo”, y ~ N(B, Vao’), 


where 
B = (XTX) Xy, 
Ve = (X?PXy, 
> 0.21 —0.07 
For the data at hand, s = 3.6, 8 = (45.9, 3.2), and Vg = ( -0.07 0.04 ) 


7.1 Measures of predictive accuracy 


One way to evaluate a model is through the accuracy of its predictions. Sometimes we 
care about this accuracy for its own sake, as when evaluating a forecast. In other settings, 
predictive accuracy is valued for comparing different models rather than for its own sake. We 
begin by considering different ways of defining the accuracy or error of a model’s predictions, 
then discuss methods for estimating predictive accuracy or error from data. 
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Preferably, the measure of predictive accuracy is specifically tailored for the application 
at hand, and it measures as correctly as possible the benefit (or cost) of predicting future 
data with the model. Examples of application-specific measures are classification accuracy 
and monetary cost. More examples are given in Chapter 9 in the context of decision analysis. 
For many data analyses, however, explicit benefit or cost information is not available, and 
the predictive performance of a model is assessed by generic scoring functions and rules. 

In point prediction (predictive point estimation or point forecasting) a single value is 
reported as a prediction of the unknown future observation. Measures of predictive accuracy 
for point prediction are called scoring functions. We consider the squared error as an 
example as it is the most common scoring function in the literature on prediction. 


Mean squared error. A model’s fit to new data can be summarized in point prediction 
by mean squared error, + $}; (yi — E(yi|9))?, or a weighted version such as 4 5} 4 (yi — 
E(y;|0))?/var(y;|0). These measures have the advantage of being easy to compute and, more 
importantly, to interpret, but the disadvantage of being less appropriate for models that 
are far from the normal distribution. 

In probabilistic prediction (probabilistic forecasting) the aim is to report inferences about 
y in such a way that the full uncertainty over y is taken into account. Measures of pre- 
dictive accuracy for probabilistic prediction are called scoring rules. Examples include the 
quadratic, logarithmic, and zero-one scores. Good scoring rules for prediction are proper 
and local: propriety of the scoring rule motivates the decision maker to report his or her 
beliefs honestly, and locality incorporates the possibility that bad predictions for some y 
may be judged more harshly than others. It can be shown that the logarithmic score is the 
unique (up to an affine transformation) local and proper scoring rule, and it is commonly 
used for evaluating probabilistic predictions. 


Log predictive density or log-likelihood. The logarithmic score for predictions is the log 
predictive density log p(y|@), which is proportional to the mean squared error if the model 
is normal with constant variance. The log predictive density is also sometimes called the log- 
likelihood. The log predictive density has an important role in statistical model comparison 
because of its connection to the Kullback-Leibler information measure. In the limit of large 
sample sizes, the model with the lowest Kullback-Leibler information—and thus, the highest 
expected log predictive density—will have the highest posterior probability. Thus, it seems 
reasonable to use expected log predictive density as a measure of overall model fit. Due 
to its generality, we use the log predictive density to measure predictive accuracy in this 
chapter. 

Given that we are working with the log predictive density, the question may arise: why 
not use the log posterior? Why only use the data model and not the prior density in this 
calculation? The answer is that we are interested here in summarizing the fit of model to 
data, and for this purpose the prior is relevant in estimating the parameters but not in 
assessing a model’s accuracy. 

We are not saying that the prior cannot be used in assessing a model’s fit to data; rather 
we say that the prior density is not relevant in computing predictive accuracy. Predictive 
accuracy is not the only concern when evaluating a model, and even within the bailiwick 
of predictive accuracy, the prior is relevant in that it affects inferences about 0 and thus 
affects any calculations involving p(y|@). In a sparse-data setting, a poor choice of prior 
distribution can lead to weak inferences and poor predictions. 


Predictive accuracy for a single data point 


The ideal measure of a model’s fit would be its out-of-sample predictive performance for 
new data produced from the true data-generating process (external validation). We label f 
as the true model, y as the observed data (thus, a single realization of the dataset y from 
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the distribution f(y)), and gy as future data or alternative datasets that could have been 
seen. The out-of-sample predictive fit for a new data point ği using logarithmic score is 
then, 


log Ppost (Ji) = log Epost (p(Yil9)) = log JPOP Ps (0)d0. 


In the above expression, Ppost(Yi) is the predictive density for y; induced by the posterior 
distribution Ppost(0). We have introduced the notation ppost here to represent the posterior 
distribution because our expressions will soon become more complicated and it will be 
convenient to avoid explicitly showing the conditioning of our inferences on the observed 
data y. More generally, we use ppost and Epost to denote any probability or expectation 
that averages over the posterior distribution of 6. 


Averaging over the distribution of future data 


We must then take one further step. The future data y; are themselves unknown and thus 
we define the expected out-of-sample log predictive density, 


elpd = expected log predictive density for a new data point 
Eş (log Ppost(s)) = f Cos Pos (7) (Gi) (7.1) 


In any application, we would have some ppost but we do not in general know the data 
distribution f. A natural way to estimate the expected out-of-sample log predictive density 
would be to plug in an estimate for f, but this will tend to imply too good a fit, as we discuss 
in Section 7.2. For now we consider the estimation of predictive accuracy in a Bayesian 
context. 

To keep comparability with the given dataset, one can define a measure of predictive 
accuracy for the n data points taken one at a time: 


elppd = expected log pointwise predictive density for a new dataset 


n 
= > Ep(log ppost (si), (7.2) 

i=1 
which must be defined based on some agreed-upon division of the data y into individual data 
points y;. The advantage of using a pointwise measure, rather than working with the joint 
posterior predictive distribution, ppost(y) is in the connection of the pointwise calculation 
to cross-validation, which allows some fairly general approaches to approximation of out- 

of-sample fit using available data. 

It is sometimes useful to consider predictive accuracy given a point estimate 6(y), thus, 


expected log predictive density, given 6: Eş (log p(9|6)). (7.3) 


For models with independent data given parameters, there is no difference between joint or 
pointwise prediction given a point estimate, as p(g|@) = JJ; p(vilé). 


Evaluating predictive accuracy for a fitted model 


In practice the parameter 0 is not known, so we cannot know the log predictive density 
log p(y|@). For the reasons discussed above we would like to work with the posterior dis- 
tribution, ppost(?) = p(0|y), and summarize the predictive accuracy of the fitted model to 
data by 


Ippd = log pointwise predictive density 
= t08]] Ppost(us) = $ 10g | p(s ®)Ppose(8)a8. (7.4) 
i=1 i=1 
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To compute this predictive density in practice, we can evaluate the expectation using draws 
from ppost(@), the usual posterior simulations, which we label 0°, s = 1,..., S: 

computed lppd = computed log pointwise predictive density 


n Ss 
= 2log (5 a) (7.5) 


We typically assume that the number of simulation draws S is large enough to fully cap- 
ture the posterior distribution; thus we shall refer to the theoretical value (7.4) and the 
computation (7.5) interchangeably as the log pointwise predictive density or lppd of the 
data. 

As we shall discuss in Section 7.2, the lppd of observed data y is an overestimate of the 
elppd for future data (7.2). Hence the plan is to start with (7.5) and then apply some sort 
of bias correction to get a reasonable estimate of (7.2). 


Choices in defining the likelihood and predictive quantities 


As is well known in hierarchical modeling, the line separating prior distribution from like- 
lihood is somewhat arbitrary and is related to the question of what aspects of the data 
will be changed in hypothetical replications. In a hierarchical model with direct parameters 
Q1,-..,@y and hyperparameters ¢, factored as p(a, oly) x p(¢) Ij- P(os1¢)p(yjlay), we 
can imagine replicating new data in existing groups (with the ‘likelihood’ being propor- 
tional to p(y|a;)) or new data in new groups (a new a 741 is drawn, and the ‘likelihood’ is 
proportional to p(y|¢) = [p(as+1|¢)p(ylaz41)daz+41). In either case we can easily compute 
the posterior predictive density of the observed data y: 


e When predicting gla; (that is, new data from existing groups), we compute p(yla%) for 
each posterior simulation a} and then take the average, as in (7.5). 


e When predicting ylay41 (that is, new data from a new group), we sample a$}; from 
p(as+i|¢*) to compute p(y|a% +1). 

Similarly, in a mixture model, we can consider replications conditioning on the mixture 

indicators, or replications in which the mixture indicators are redrawn as well. 

Similar choices arise even in the simplest experiments. For example, in the model 
Yis---;Y¥n ~ N(u,07), we have the option of assuming the sample size is fixed by design 
(that is, leaving n unmodeled) or treating it as a random variable and allowing a new ñ in 
a hypothetical replication. 

We are not bothered by the nonuniqueness of the predictive distribution. Just as with 
posterior predictive checks, different distributions correspond to different potential uses of a 
posterior inference. Given some particular data, a model might predict new data accurately 
in some scenarios but not in others. 


7.2 Information criteria and cross-validation 


1 For historical reasons, measures of predictive accuracy are referred to as information 
criteria and are typically defined based on the deviance (the log predictive density of the 
data given a point estimate of the fitted model, multiplied by —2; that is —2 log p(y|9)). 

A point estimate Ô and posterior distribution Ppost(@) are fit to the data y, and out- 
of-sample predictions will typically be less accurate than implied by the within-sample 
predictive accuracy. To put it another way, the accuracy of a fitted model’s predictions of 
future data will generally be lower, in expectation, than the accuracy of the same model’s 


1P.S. Instead of this Section we recommend to read Vehtari, Gelman, and Gabry (2017). 
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predictions for observed data—even if the family of models being fit happens to include 
the true data-generating process, and even if the parameters in the model happen to be 
sampled exactly from the specified prior distribution. 

We are interested in prediction accuracy for two reasons: first, to measure the per- 
formance of a model that we are using; second, to compare models. Our goal in model 
comparison is not necessarily to pick the model with lowest estimated prediction error or 
even to average over candidate models—as discussed in Section 7.5, we prefer continuous 
model expansion to discrete model choice or averaging—but at least to put different models 
on a common scale. Even models with completely different parameterizations can be used 
to predict the same measurements. 

When different models have the same number of parameters estimated in the same 
way, one might simply compare their best-fit log predictive densities directly, but when 
comparing models of differing size or differing effective size (for example, comparing logistic 
regressions fit using uniform, spline, or Gaussian process priors), it is important to make 
some adjustment for the natural ability of a larger model to fit data better, even if only by 
chance. 


Estimating out-of-sample predictive accuracy using available data 


Several methods are available to estimate the expected predictive accuracy without waiting 
for out-of-sample data. We cannot compute formulas such as (7.1) directly because we 
do not know the true distribution, f. Instead we can consider various approximations. 
We know of no approximation that works in general, but predictive accuracy is important 
enough that it is still worth trying. We list several reasonable-seeming approximations here. 
Each of these methods has flaws, which tells us that any predictive accuracy measure that 
we compute will be only approximate. 


e Within-sample predictive accuracy. A naive estimate of the expected log predictive den- 
sity for new data is the log predictive density for existing data. As discussed above, we 
would like to work with the Bayesian pointwise formula, that is, lppd as computed using 
the simulation (7.5). This summary is quick and easy to understand but is in general an 
overestimate of (7.2) because it is evaluated on the data from which the model was fit. 


e Adjusted within-sample predictive accuracy. Given that lppd is a biased estimate of 
elppd, the next logical step is to correct that bias. Formulas such as AIC, DIC, and 
WAIC (all discussed below) give approximately unbiased estimates of elppd by starting 
with something like lppd and then subtracting a correction for the number of parameters, 
or the effective number of parameters, being fit. These adjustments can give reasonable 
answers in many cases but have the general problem of being correct at best only in 
expectation, not necessarily in any given case. 


e Cross-validation. One can attempt to capture out-of-sample prediction error by fitting 
the model to training data and then evaluating this predictive accuracy on a holdout set. 
Cross-validation avoids the problem of overfitting but remains tied to the data at hand 
and thus can be correct at best only in expectation. In addition, cross-validation can be 
computationally expensive: to get a stable estimate typically requires many data parti- 
tions and fits. At the extreme, leave-one-out cross-validation (LOO-CV) requires n fits 
except when some computational shortcut can be used to approximate the computations. 


Log predictive density asymptotically, or for normal linear models 


Before introducing information criteria it is useful to discuss some asymptotical results. 
Under conditions specified in Chapter 4, the posterior distribution, p(6|y), approaches a 
normal distribution in the limit of increasing sample size. In this asymptotic limit, the 
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-50 -45 -40 

Log predictive density, p(y|®) 
Figure 7.2 Posterior distribution of the log predictive density log p(y|@) for the election forecasting 
example. The variation comes from posterior uncertainty in 0. The maximum value of the dis- 
tribution, —40.3, is the log predictive density when 0 is at the maximum likelihood estimate. The 
mean of the distribution is —42.0, and the difference between the mean and the maximum is 1.7, 
which is close to the value of 3 that would be predicted from asymptotic theory, given that we are 
estimating 3 parameters (two coefficients and a residual variance). 


posterior is dominated by the likelihood—the prior contributes only one factor, while the 
likelihood contributes n factors, one for each data point—and so the likelihood function 
also approaches the same normal distribution. 

As sample size n — oo, we can label the limiting posterior distribution as 6/y > 
N (00, Vo/n). In this limit the log predictive density is 


log p(yl8) = e(y) — 5 (k log(27) + log Vo/n| + (8 — 0)" Vo/n)-*(0 — 60)) 


where c(y) is a constant that only depends on the data y and the model class but not on 
the parameters 0. 

The limiting multivariate normal distribution for 0 induces a posterior distribution for 
the log predictive density that ends up being a constant (c(y) — 4 (k log(2m) + log |Vo/n])) 
minus 4 times a x¿ random variable, where k is the dimension of 0, that is, the number of 
parameters in the model. The maximum of this distribution of the log predictive density 
is attained when 6 equals the maximum likelihood estimate (of course), and its posterior 
mean is at a value a lower. For actual posterior distributions, this asymptotic result is only 
an approximation, but it will be useful as a benchmark for interpreting the log predictive 
density as a measure of fit. 

With singular models (such as mixture models and overparameterized complex models 
more generally discussed in Part V of this book), a single data model can arise from more 
than one possible parameter vector, the Fisher information matrix is not positive definite, 
plug-in estimates are not representative of the posterior distribution, and the deviance does 
not converge to a x? distribution. 


Example. Fit of the election forecasting model: Bayesian inference 

The log predictive probability density of the data is ya log(N(y;ļa + bzi, 07)), with 
an uncertainty induced by the posterior distribution, Ppost(a, b, o°). 

Posterior distribution of the observed log predictive density, p(y|@). The posterior 
distribution ppost(9) = p(a, b, 7?|y) is normal-inverse-x”. To get a sense of uncertainty 
in the log predictive density p(y;|@), we compute it for each of S = 10,000 posterior 
simulation draws of 0. Figure 7.2 shows the resulting distribution, which looks roughly 
like a x (no surprise since three parameters are being estimated—two coefficients 
and a variance—and the sample size of 15 is large enough that we would expect the 
asymptotic normal approximation to the posterior distribution to be pretty good), 
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scaled by a factor of -4 and shifted so that its upper limit corresponds to the maximum 
likelihood estimate (with log predictive density of —40.3, as noted earlier). The mean 
of the posterior distribution of the log predictive density is —42.0, and the difference 
between the mean and the maximum is 1.7, which is close to the value of 3 that would 
be predicted from asymptotic theory, given that 3 parameters are being estimated. 


Akaike information criterion (AIC) 


In much of the statistical literature on predictive accuracy, inference for 0 is summarized 
not by a posterior distribution ppost but by a point estimate 0, typically the maximum 
likelihood estimate. Out-of-sample predictive accuracy is then defined not by the expected 
log posterior predictive density (7.1) but by elpdg = Ey (log p(g|9(y))) defined in (7.3), where 
both y and ¥ are random. There is no direct way to calculate (7.3); instead the standard 
approach is to use the log posterior density of the observed data y given a point estimate Ê 
and correct for bias due to overfitting. 

Let k be the number of parameters estimated in the model. The simplest bias correction 
is based on the asymptotic normal posterior distribution. In this limit (or in the special case 
of a normal linear model with known variance and uniform prior distribution), subtracting 
k from the log predictive density given the maximum likelihood estimate is a correction for 
how much the fitting of k parameters will increase predictive accuracy, by chance alone: 


elpd ayo = log p(ylOmie) — k- (7.6) 


AIC is defined as the above quantity multiplied by —2; thus AIC = —2 log plylĝmie) + 2k. 

It makes sense to adjust the deviance for fitted parameters, but once we go beyond 
linear models with flat priors, we cannot simply add k. Informative prior distributions and 
hierarchical structures tend to reduce the amount of overfitting, compared to what would 
happen under simple least squares or maximum likelihood estimation. 

For models with informative priors or hierarchical structure, the effective number of pa- 
rameters strongly depends on the variance of the group-level parameters. We shall illustrate 
in Section 7.3 with the example of educational testing experiments in 8 schools. Under the 
hierarchical model in that example, we would expect the effective number of parameters to 
be somewhere between 8 (one for each school) and 1 (for the average of the school effects). 


Deviance information criterion (DIC) and effective number of parameters 


DIC is a somewhat Bayesian version of AIC that takes formula (7.6) and makes two changes, 
replacing the maximum likelihood estimate 0 with the posterior mean OBayes = E(@|y) and 
replacing k with a data-based bias correction. The new measure of predictive accuracy is, 


elpdpic = log p(yl@Bayes) — Poic, (7.7) 
where ppyc is the effective number of parameters, defined as, 
pprc = 2 (log plylÊsayes) — Epost log p(yl0))) , (7.8) 


where the expectation in the second term is an average of 0 over its posterior distribution. 
Expression (7.8) is calculated using simulations 6°, s = 1,..., S as, 


— 


S 
computed ppic = 2 (ios plodan -3 X. env") : (7.9) 


s=1 
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The posterior mean of 0 will produce the maximum log predictive density when it happens 
to be the same as the mode, and negative pprc can be produced if posterior mean is far 
from the mode. 

An alternative version of DIC uses a slightly different effective number of parameters: 


PDICalt = 2 varpost (log p(y|0)). (7.10) 


Both ppic and ppicait give the correct answer in the limit of fixed model and large n 
and can be derived from the asymptotic x? distribution (shifted and scaled by a factor of 
—t) of the log predictive density. For linear models with uniform prior distributions, both 
these measures of effective sample size reduce to k. Of these two measures, pprc is more 
numerically stable but pprcait has the advantage of always being positive. Compared to 
previous proposals for estimating the effective number of parameters, easier and more stable 
Monte Carlo approximation of DIC made it quickly popular. 

The actual quantity called DIC is defined in terms of the deviance rather than the log 
predictive density; thus, 


DIC = —2 log p(y|OBayes) a 2ppic- 


Watanabe-Akaike or widely applicable information criterion (WAIC) 


WAIC is a more fully Bayesian approach for estimating the out-of-sample expectation (7.2), 
starting with the computed log pointwise posterior predictive density (7.5) and then adding 
a correction for effective number of parameters to adjust for overfitting. 
Two adjustments have been proposed. Both are based on pointwise calculations and 
can be viewed as approximations to cross-validation, based on derivations not shown here. 
The first approach is a difference, similar to that used to construct pprc: 


pwaic1 =2)_ (1o8( pout) = Bos (log r(%l8)) 


t=1 


computed by replacing the expectations by averages over the S posterior draws 6°: 


n S S 
1 1 
computed pwaici = 2 log {= D_ p(yil9°) | — | D_ log p(yilO) |- 
S s=1 S 4=1 


i=l 


The other measure uses the variance of individual terms in the log predictive density 
summed over the n data points: 


PWAIC2 = $ varpost (log p(y:10)). (7.11) 


i=1 


This expression looks similar to (7.10), the formula for pprc ait (although without the factor 
of 2), but is more stable because it computes the variance separately for each data point 
and then sums; the summing yields stability. 

To calculate (7.11), we compute the posterior variance of the log predictive density for 
each data point y;, that is, V$} log p(yi|0°), where V£} represents the sample variance, 
Vets = z= Eilas —a)?. Summing over all the data points y; gives the effective 
number of parameters: 


computed pwarc2 = X_ V2, (log p(y:10°)). (7.12) 


i=1 
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We can then use either pwaic1 or pwaicg2 as a bias correction: 


elppdwarc = lppd — pwarc. (7.13) 


In the present discussion, we evaluate both pwarcı and pwaice2. For practical use, 
we recommend pwarc2 because its series expansion has closer resemblance to the series 
expansion for LOO-CV and also in practice seems to give results closer to LOO-CV. 

As with AIC and DIC, we define WAIC as —2 times the expression (7.13) so as to be 
on the deviance scale: 

WAIC = —2Ippd + 2pwaica, 


with lppd computed as in (7.5) and pwaic2 computed in (7.12). 

In Watanabe’s original definition, WAIC is the negative of the average log pointwise 
predictive density (assuming the prediction of a single new data point) and thus is divided 
by n and does not have the factor 2; here we scale it so as to be comparable with AIC, DIC, 
and other measures of deviance. 

For a normal linear model with large sample size, known variance, and uniform prior 
distribution on the coefficients, pwarcı and pwarc2 are approximately equal to the number 
of parameters in the model. More generally, the adjustment can be thought of as an ap- 
proximation to the number of ‘unconstrained’ parameters in the model, where a parameter 
counts as 1 if it is estimated with no constraints or prior information, 0 if it is fully con- 
strained or if all the information about the parameter comes from the prior distribution, or 
an intermediate value if both the data and prior distributions are informative. 

Compared to AIC and DIC, WAIC has the desirable property of averaging over the 
posterior distribution rather than conditioning on a point estimate. This is especially rele- 
vant in a predictive context, as WAIC is evaluating the predictions that are actually being 
used for new data in a Bayesian context. AIC and DIC estimate the performance of the 
plug-in predictive density, but Bayesian users of these measures would still use the posterior 
predictive density for predictions. 

Other information criteria are based on Fisher’s asymptotic theory assuming a regular 
model for which the likelihood or the posterior converges to a single point, and where 
maximum likelihood and other plug-in estimates are asymptotically equivalent. WAIC 
works also with singular models and thus is particularly helpful for models with hierarchical 
and mixture structures in which the number of parameters increases with sample size and 
where point estimates often do not make sense. 

For all these reasons, we find WAIC more appealing than AIC and DIC. A cost of using 
WAIC is that it relies on a partition of the data into n pieces, which is not so easy to do in 
some structured-data settings such as time series, spatial, and network data. AIC and DIC 
do not make this partition explicitly, but derivations of AIC and DIC assume that residuals 
are independent given the point estimate ĝ: conditioning on a point estimate Ê eliminates 
posterior dependence at the cost of not fully capturing posterior uncertainty. 


Effective number of parameters as a random variable 


It makes sense that pprc and pwarc depend not just on the structure of the model but 
on the particular data that happen to be observed. For a simple example, consider the 
model yi,...,Yn ~ N(8,1), with n large and 6 ~ U(0,co). That is, 6 is constrained to 
be positive but otherwise has a noninformative uniform prior distribution. How many 
parameters are being estimated in this model? If the measurement y is close to zero, then 
the effective number of parameters p is approximately 4, since roughly half the information 
in the posterior distribution is coming from the data and half from the prior constraint of 
positivity. However, if y is positive and large, then the constraint is essentially irrelevant, 
and the effective number of parameters is approximately 1. This example illustrates that, 
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even with a fixed model and fixed true parameters, it can make sense for the effective 
number of parameters to depend on data. 


‘Bayesian’ information criterion (BIC) 


There is also something called the Bayesian information criterion (a misleading name, we 
believe) that adjusts for the number of fitted parameters with a penalty that increases with 
the sample size, n. The formula is BIC = —2 log p(y|9) + k logn, which for large datasets 
gives a larger penalty per parameter compared to AIC and thus favors simpler models. 
BIC differs from the other information criteria considered here in being motivated not by 
an estimation of predictive fit but by the goal of approximating the marginal probability 
density of the data, p(y), under the model, which can be used to estimate relative posterior 
probabilities in a setting of discrete model comparison. For reasons described in Section 7.4, 
we do not typically find it useful to think about the posterior probabilities of models, but we 
recognize that others find BIC and similar measures helpful for both theoretical and applied 
reason. At this point, we merely point out that BIC has a different goal than the other 
measures we have discussed. It is completely possible for a complicated model to predict 
well and have a low AIC, DIC, and WAIC, but, because of the penalty function, to have a 
relatively high (that is, poor) BIC. Given that BIC is not intended to predict out-of-sample 
model performance but rather is designed for other purposes, we do not consider it further 
here. 


Leave-one-out cross-validation 


In Bayesian cross-validation, the data are repeatedly partitioned into a training set Ytrain 
and a holdout set Ynoidout, and then the model is fit to Ytrain (thus yielding a posterior 
distribution Ptrain(@) = p(|Ytrain)), with this fit evaluated using an estimate of the log 
predictive density of the holdout data, log Ptrain (Yholdout) = log J Ppred (Yholdout|9) Ptrain()d8. 
Assuming the posterior distribution p(0|Ytrain) is summarized by S simulation draws 8°, we 


calculate the log predictive density as log (4 sone P(notdoutl4*)) 


For simplicity, we restrict our attention here to leave-one-out cross-validation (LOO- 
CV), the special case with n partitions in which each holdout set represents a single data 
point. Performing the analysis for each of the n data points (or perhaps a random subset for 
efficient computation if n is large) yields n different inferences p,ost(—i), each summarized 
by S posterior simulations, 0*5. 

The Bayesian LOO-CV estimate of out-of-sample predictive fit is 


n S 
IPPdioo—cy = SO hE i (Yi), calculated as Sis (3. dm p(yi|0"*) ji (7.14) 


i=l i=1 


Each prediction is conditioned on n — 1 data points, which causes underestimation of the 
predictive fit. For large n the difference is negligible, but for small n (or when using k-fold 
cross-validation) we can use a first order bias correction b by estimating how much better 
predictions would be obtained if conditioning on n data points: 


b = lppd — Ippd_,, 


where 


n n n n S 
lppd_; = > S log Ppost(—i) (Ys) calculated as ISSE log (z: Dd? (y;|0"*) )). 


i=1 j=1 i=1 j=1 
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The bias-corrected Bayesian LOO-CV is then 


lPPddoo—cy = lppdiss—cy + b. 


The bias correction b is rarely used as it is usually small, but we include it for completeness. 
To make comparisons to other methods, we compute an estimate of the effective number 

of parameters as 
Ploo—cv = lppd = Ippdigo—ev (7.15) 


or, using bias-corrected LOO-CV, 


Pcloo-cv = Ippd — Ippdeigo 
Ippd_; — Ippdjgo. 


Cross-validation is like WAIC in that it requires data to be divided into disjoint, ideally 
conditionally independent, pieces. This represents a limitation of the approach when ap- 
plied to structured models. In addition, cross-validation can be computationally expensive 
except in settings where shortcuts are available to approximate the distributions Ppost(—i) 
without having to re-fit the model each time. For example, LOO-CV can be efficiently 
approximated using the draws from the full posterior distribution and Pareto-smoothed 
importance sampling without the need to re-fit the model. In this chapter we use the brute 
force approach for clarity. If no shortcuts are available, common approach is to use k-fold 
cross-validation where data is partitioned in k sets. With moderate value of k, for example 
10, computation time is reasonable in most applications. 

Under some conditions, different information criteria have been shown to be asymptot- 
ically equal to leave-one-out cross-validation (in the limit n — oo, the bias correction can 
be ignored in the proofs). AIC has been shown to be asymptotically equal to LOO-CV as 
computed using the maximum likelihood estimate. DIC is a variation of the regularized 
information criteria which have been shown to be asymptotically equal to LOO-CV using 
plug-in predictive densities. WAIC has been shown to be asymptotically equal to Bayesian 
LOO-CV. 

For finite n there is a difference, as LOO-CV conditions the posterior predictive densities 
on n — 1 data points. These differences can be apparent for small n or in hierarchical 
models, as we discuss in our examples. Other differences arise in regression or hierarchical 
models. LOO-CV assumes the prediction task p(j|%:,y—i,v—;) while WAIC estimates 
p(yily, £) = p(Yilyi,; i, y-i, £—i), So WAIC is making predictions only at x-locations already 
observed (or in subgroups indexed by 2;). This can make a noticeable difference in flexible 
regression models such as Gaussian processes or hierarchical models where prediction given 
x; may depend only weakly on all other data points (y_;,7_;). We illustrate with a simple 
hierarchical model in Section 7.3. 

The cross-validation estimates are similar to the non-Bayesian resampling method known 
as jackknife. Even though we are working with the posterior distribution, our goal is to 
estimate an expectation averaging over y™°P in its true, unknown distribution, f; thus, we 
are studying the frequency properties of a Bayesian procedure. 


Comparing different estimates of out-of-sample prediction accuracy 


All the different measures discussed above are based on adjusting the log predictive density 
of the observed data by subtracting an approximate bias correction. The measures differ 
both in their baseline measures of fit and in their adjustments. 

AIC starts with the log predictive density of the data conditional on the maximum 
likelihood estimate 6, DIC conditions on the posterior mean E(@|y), and WAIC starts with 
the log predictive density, averaging over Ppost(0) = p(6|y). Of these three approaches, only 
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WAIC is fully Bayesian and so it is our preference when using a bias correction formula. 
Cross-validation can be applied to any measure of fit; we use the log pointwise posterior 
predictive density as with WAIC. 


Example. Predictive error in the election forecasting model 
We illustrate the different estimates of out-of-sample log predictive density using the 
regression model of 15 elections introduced on page 165. 
AIC. Fit to all 15 data points, the maximum likelihood estimate of the vector (â, b, ô) 
is (45.9, 3.2, 3.6). Since 3 parameters are estimated, the value of elpdajc is 

15 

X log N(yi|45.9 + 3.20, 3.67) — 3 = — 43.3, 

i=1 
and AIC = —2elpdajq = 86.6. 
DIC. The relevant formula is pprc = 2 (log p(y|Epost(9)) — Epost (log p(y|9)))- 
The second of these terms is invariant to reparameterization; we calculate it as 


S 15 
1 ' 
Epost(log p(yl0)) = 5 X > log N(yila® + b°z:, (0°)?) = —42.0, 


s=1 į=1 


based on a large number S' of simulation draws. 

The first term is not invariant. With respect to the prior p(a,b,logo) « 1, the 
posterior means of a and b are 45.9 and 3.2, the same as the maximum likelihood 
estimate. The posterior means of c, o°, and logo are E(oly) = 4.1, E(o?|y) = 17.2, 
and E(log oly) = 1.4. Parameterizing using o, we get 


15 
log p(y|Epost(8)) = X- log N(yi|E(aly) + EQly) xi, (E(oly))?) = —40.5, 


i=1 
which gives pprc = 2(—40.5 — (—42.0)) = 3.0, elpdpic = log p(y|Epost(9)) — pprc = 
—40.5 — 3.0 = —43.5, and DIC = —2elpdpyc = 87.0. 

WAIC. The log pointwise predictive probability of the observed data under the fitted 


model is 
15 1 S 
lppd= y 1 = X N(y,la® + bzi, (o°)) | = —40.9. 
pp Stee ($30 (yila? + b*x e) 


The effective number of parameters can be calculated as 


PWAIC1 = 2(lppd _ Epost(yl9)), = 2(—40.9 5 (—42.0)) =2.2 


or 
15 
PWAIC2 = 5 VŽ log N(yjla® + b°x;, (o°)?) = 2.7. 
i= 
Then elppdwaic, = lppd — pwarc1 = —40.9 — 2.2 = —43.1, and elppdwarco = 


lppd — pwaic2 = —40.9 — 2.7 = —43.6, so WAIC is 86.2 or 87.2. 


Leave-one-out cross-validation. We fit the model 15 times, leaving out a different 
data point each time. For each fit of the model, we sample S times from the posterior 
distribution of the parameters and compute the log predictive density. The cross- 
validated pointwise predictive accuracy is 


15 S 
1 
IPPdioo-ev = >, log (3 XO Noya! + ba, oy) 
s=1 


l=1 
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which equals —43.8. Multiplying by —2 to be on the same scale as AIC and the others, 
we get 87.6. The effective number of parameters from cross-validation, from (7.15), is 
Ploo—cv = E(lppd) — E(Ippd,,,_.,) = —40.9 — (—43.8) = 2.9. 

Given that this model includes two linear coefficients and a variance parameter, these 
all look like reasonable estimates of the effective number of parameters. 


External validation. How well will the model predict new data? The 2012 election 
gives an answer, but it is just one data point. This illustrates the difficulty with 
external validation for this sort of problem. 


7.3 Model comparison based on predictive performance 


2 There are generally many options in setting up a model for any applied problem. Our usual 
approach is to start with a simple model that uses only some of the available information— 
for example, not using some possible predictors in a regression, fitting a normal model to 
discrete data, or ignoring evidence of unequal variances and fitting a simple equal-variance 
model. Once we have successfully fitted a simple model, we can check its fit to data (as 
discussed in Sections 6.3 and 6.4) and then expand it (as discussed in Section 7.5). 

There are two typical scenarios in which models are compared. First, when a model is 
expanded, it is natural to compare the smaller to the larger model and assess what has been 
gained by expanding the model (or, conversely, if a model is simplified, to assess what was 
lost). This generalizes into the problem of comparing a set of nested models and judging 
how much complexity is necessary to fit the data. 

In comparing nested models, the larger model typically has the advantage of making 
more sense and fitting the data better but the disadvantage of being more difficult to 
understand and compute. The key questions of model comparison are typically: (1) is the 
improvement in fit large enough to justify the additional difficulty in fitting, and (2) is the 
prior distribution on the additional parameters reasonable? 

The second scenario of model comparison is between two or more nonnested models— 
neither model generalizes the other. One might compare regressions that use different sets of 
predictors to fit the same data, for example, modeling political behavior using information 
based on past voting results or on demographics. In these settings, we are typically not 
interested in choosing one of the models—it would be better, both in substantive and 
predictive terms, to construct a larger model that includes both as special cases, including 
both sets of predictors and also potential interactions in a larger regression, possibly with an 
informative prior distribution if needed to control the estimation of all the extra parameters. 
However, it can be useful to compare the fit of the different models, to see how either set of 
predictors performs when considered alone. 

In any case, when evaluating models in this way, it is important to adjust for overfitting, 
especially when comparing models that vary greatly in their complexity. 


Example. Expected predictive accuracy of models for the eight schools 
In the example of Section 5.5, three modes of inference were proposed: 


e No pooling: Separate estimates for each of the eight schools, reflecting that the 
experiments were performed independently and so each school’s observed value is 
an unbiased estimate of its own treatment effect. This model has eight parameters: 
an estimate for each school. 

e Complete pooling: A combined estimate averaging the data from all schools into a 
single number, reflecting that the eight schools were actually quite similar (as were 
the eight different treatments), and also reflecting that the variation among the 
eight estimates (the left column of numbers in Table 5.2) is no larger than would 


2P.S. Instead of this Section we recommend to read Vehtari, Gelman, and Gabry (2017). 
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No Complete Hierarchical 
pooling pooling model 
(r=) (r=0) (T estimated) 
—2lpd = —2 log p(y|Omiec) 54.6 59.4 
AIC k 8.0 1.0 
AIC = —2élpdayo 70.6 61.4 
—2Ipd = —2log p(y|Oaayes) 54.6 59.4 57.4 
DIC pora 8.0 1.0 2.8 
DIC = —2elpdpig 70.6 61.4 63.0 
—QIppd = —2)* log Pposi(y:) 60.2 59.8 50.2 
WAIC PWAIC1 2.5 0.6 1.0 
PWAIC 2 4.0 0.7 1.3 
WAIC = —2elppdwarco 68.2 61.2 61.8 
—2lppd 59.8 59.2 
LOO-CV  Pioo—ev 0.5 1.8 
—2lppd,oo_ev 60.8 62.8 


Table 7.1 Deviance (—2 times log predictive density) and corrections for parameter fitting using 
AIC, DIC, WAIC (using the correction pwatc2), and leave-one-out cross-validation for each of 
three models fitted to the data in Table 5.2. Lower values of AIC/DIC/WAIC imply higher predictive 
accuracy. 

Blank cells in the table correspond to measures that are undefined: AIC is defined relative to the 
mazimum likelihood estimate and so is inappropriate for the hierarchical model; cross-validation 
requires prediction for the held-out case, which is impossible under the no-pooling model. 

The no-pooling model has the best raw fit to data, but after correcting for fitted parameters, the 
complete-pooling model has lowest estimated expected predictive error under the different measures. 
In general, we would expect the hierarchical model to win, but in this particular case, setting T = 0 
(that is, the complete-pooling model) happens to give the best average predictive performance. 


be expected by chance alone given the standard errors (the rightmost column in 
the table). This model has only one, shared, parameter. 


e Hierarchical model: A Bayesian meta-analysis, partially pooling the eight estimates 
toward a common mean. This model has eight parameters but they are constrained 
through their hierarchical distribution and are not estimated independently; thus 
the effective number of parameters should be some number less than 8. 


Table 7.1 illustrates the use of predictive log densities and information criteria to 
compare the three models—no pooling, complete pooling, and hierarchical—fitted to 
the SAT coaching data. We only have data at the group level, so we necessarily 
define our data points and cross-validation based on the 8 schools, not the individual 
students. 

We shall go down the rows of Table 7.1 to understand how the different information 
criteria work for each of these three models, then we discuss how these measures can 
be used to compare the models. 


AIC. The log predictive density is higher—that is, a better fit—for the no pooling 
model. This makes sense: with no pooling, the maximum likelihood estimate is right 
at the data, whereas with complete pooling there is only one number to fit all 8 schools. 
However, the ranking of the models changes after adjusting for the fitted parameters 
(8 for no pooling, 1 for complete pooling), and the expected log predictive density 
is estimated to be the best (that is, AIC is lowest) for complete pooling. The last 
column of the table is blank for AIC, as this procedure is defined based on maximum 
likelihood estimation which is meaningless for the hierarchical model. 
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DIC. For the no-pooling and complete-pooling models with their flat priors, DIC 
gives results identical to AIC (except for possible simulation variability, which we have 
essentially eliminated here by using a large number of posterior simulation draws). 
DIC for the hierarchical model gives something in between: a direct fit to data (Ipd) 
that is better than complete pooling but not as good as the (overfit) no pooling, and 
an effective number of parameters of 2.8, closer to 1 than to 8, which makes sense 
given that the estimated school effects are pooled almost all the way back to their 
common mean. Adding in the correction for fitting, complete pooling wins, which 
makes sense given that in this case the data are consistent with zero between-group 
variance. 


WAIC. This fully Bayesian measure gives results similar to DIC. The fit to observed 
data is slightly worse for each model (that is, the numbers for lppd are slightly more 
negative than the corresponding values for lpd, higher up in the table), accounting 
for the fact that the posterior predictive density has a wider distribution and thus 
has lower density values at the mode, compared to the predictive density conditional 
on the point estimate. However, the correction for effective number of parameters 
is lower (for no pooling and the hierarchical model, pwarc is about half of pprc), 
consistent with the theoretical behavior of WAIC when there is only a single data 
point per parameter, while for complete pooling, pwaic is only a bit less than 1 
(roughly consistent with what we would expect from a sample size of 8). For all three 
models here, pwarc is much less than pprc, with this difference arising from the fact 
that the lppd in WAIC is already accounting for much of the uncertainty arising from 
parameter estimation. 


Cross-validation. For this example it is impossible to cross-validate the no-pooling 
model as it would require the impossible task of obtaining a prediction from a held- 
out school given the other seven. This illustrates one main difference to information 
criteria, which assume new prediction for these same schools and thus work also in no- 
pooling model. For complete pooling and for the hierarchical model, we can perform 
leave-one-out cross-validation directly. In this model the local prediction of cross- 
validation is based only on the information coming from the other schools, while the 
local prediction in WAIC is based on the local observation as well as the information 
coming from the other schools. In both cases the prediction is for unknown future 
data, but the amount of information used is different and thus predictive performance 
estimates differ more when the hierarchical prior becomes more vague (with the dif- 
ference going to infinity as the hierarchical prior becomes uninformative, to yield the 
no-pooling model). 


Comparing the three models. For this particular dataset, complete pooling wins the 
expected out-of-sample prediction competition. Typically it is best to estimate the 
hierarchical variance but, in this case, r = O is the best fit to the data, and this 
is reflected in the center column of numbers in Table 7.1, where all the deviance 
measures are lower than for no pooling or complete pooling, thus correspoding to 
better predicted fit to new data. 

That said, we still prefer the hierarchical model here, because we do not believe that T 
is truly zero. For example, the estimated effect in school A is 28 (with a standard error 
of 15) and the estimate in school C is —3 (with a standard error of 16). This difference 
is not statistically significant and, indeed, the data are consistent with there being 
zero variation of effects between schools; nonetheless we would feel uncomfortable, for 
example, stating that the posterior probability is 0.5 that the effect in school C is 
larger than the effect in school A, given that data that show school A looking better. 
It might, however, be preferable to use a more informative prior distribution on 7, 
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given that very large values are both substantively implausible and also contribute to 
some of the predictive uncertainty under this model. 


In general, predictive accuracy measures are useful in parallel with posterior predictive 
checks to see if there are important patterns in the data that are not captured by each 
model. As with predictive checking, the log score can be computed in different ways for a 
hierarchical model depending on whether the parameters 0 and replications y™°P correspond 
to estimates and replications of new data from the existing groups (as we have performed 
the calculations in the above example) or new groups (additional schools from the N(, T°?) 
distribution in the above example). 


Evaluating predictive error comparisons 


When comparing models in their predictive accuracy, two issues arise, which might be called 
statistical and practical significance. Lack of statistical significance arises from uncertainty 
in the estimates of comparative out-of-sample prediction accuracy and is ultimately asso- 
ciated with variation in individual prediction errors which manifests itself in averages for 
any finite dataset. Some asymptotic theory suggests that the sampling variance of any 
estimate of average prediction error will be of order 1, so that, roughly speaking, differences 
of less than 1 could typically be attributed to chance. But this asymptotic result does not 
necessarily hold for nonnested models. A practical estimate of related sampling uncertainty 
can be obtained by analyzing the variation in the expected log predictive densities elppd, 
using parametric or nonparametric approaches. 


Sometimes it may be possible to use an application-specific scoring function that is 
so familiar for subject-matter experts that they can interpret the practical significance of 
differences. For example, epidemiologists are used to looking at differences in area under 
receiver operating characteristic curve (AUC) for classification and survival models. In 
settings without such conventional measures, it is not always clear how to interpret the 
magnitude of a difference in log predictive probability when comparing two models. Is a 
difference of 2 important? 10? 100? One way to understand such differences is to calibrate 
based on simpler models. For example, consider two models for a survey of n voters in an 
American election, with one model being completely empty (predicting p = 0.5 for each 
voter to support either party) and the other correctly assigning probabilities of 0.4 and 
0.6 (one way or another) to the voters. Setting aside uncertainties involved in fitting, the 
expected log predictive probability is log(0.5) = —0.693 per respondent for the first model 
and 0.6 log(0.6) + 0.4 log(0.4) = —0.673 per respondent for the second model. The expected 
improvement in log predictive probability from fitting the better model is then 0.02n. So, for 
n = 1000, this comes to an improvement of 20, but for n = 10 the predictive improvement 
is only 2. This would seem to accord with intuition: going from 50/50 to 60/40 is a clear 
win in a large sample, but in a smaller predictive dataset the modeling benefit would be 
hard to see amid the noise. 


In our studies of public opinion and epidemiology, we have seen cases where a model that 
is larger and better (in the sense of giving more reasonable predictions) does not appear 
dominant in the predictive comparisons. This can happen because the improvements are 
small on an absolute scale (for example, changing the predicted average response among a 
particular category of the population from 55% Yes to 60% Yes) and concentrated in only 
a few subsets of the population (those for which there is enough data so that a more com- 
plicated model yields noticeably different predictions). Average out-of-sample prediction 
error can be a useful measure but it does not tell the whole story of model fit. 
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Bias induced by model selection 


Cross-validation and information criteria make a correction for using the data twice (in 
constructing the posterior and in model assessment) and obtain asymptotically unbiased 
estimates of predictive performance for a given model. However, when these methods are 
used to choose a model selection, the predictive performance estimate of the selected model 
is biased due to the selection process. 

If the number of compared models is small, the bias is small, but if the number of 
candidate models is very large (for example, if the number of models grows exponentially 
as the number of observations n grows, or the number of predictors p is much larger than 
logn in covariate selection) a model selection procedure can strongly overfit the data. It is 
possible to estimate the selection-induced bias and obtain unbiased estimates, for example 
by using another level of cross-validation. This does not, however, prevent the model 
selection procedure from possibly overfitting to the observations and consequently selecting 
models with suboptimal predictive performance. This is one reason we view cross-validation 
and information criteria as an approach for understanding fitted models rather than for 
choosing among them. 


Challenges 


The current state of the art of measurement of predictive model fit remains unsatisfying. 
Formulas such as AIC, DIC, and WAIC fail in various examples: AIC does not work in 
settings with strong prior information, DIC gives nonsensical results when the posterior 
distribution is not well summarized by its mean, and WAIC relies on a data partition that 
would cause difficulties with structured models such as for spatial or network data. Cross- 
validation is appealing but can be computationally expensive and also is not always well 
defined in dependent data settings. 

For these reasons, Bayesian statisticians do not always use predictive error comparisons 
in applied work, but we recognize that there are times when it can be useful to compare 
highly dissimilar models, and, for that purpose, predictive comparisons can make sense. In 
addition, measures of effective numbers of parameters are appealing tools for understanding 
statistical procedures, especially when considering models such as splines and Gaussian 
processes that have complicated dependence structures and thus no obvious formulas to 
summarize model complexity. 

Thus we see the value of the methods described here, for all their flaws. Right now 
our preferred choice is cross-validation. Bayesian cross-validation is asymptotically equal 
to WAIC. Pareto-smoothed importance sampling LOO-CV is computationally as efficient 
as WAIC, but more robust in the finite case with weak priors or influential observations. 


7.4 Model comparison using Bayes factors 


So far in this chapter we have discussed model evaluation and comparison based on expected 
predictive accuracy. Another way to compare models is through a Bayesian analysis in which 
each model is given a prior probability which, when multiplied by the marginal likelihood 
(the probability of the data given the model) yields a quantity that is proportional to the 
posterior probability of the model. This fully Bayesian approach has some appeal but 
we generally do not recommend it because, in practice, the marginal likelihood is highly 
sensitive to aspects of the model that are typically assigned arbitrarily and are untestable 
from data. Here we present the general idea and illustrate with two examples, one where it 
makes sense to assign prior and posterior probabilities to discrete models, and one example 
where it does not. 

In a problem in which a discrete set of competing models is proposed, the term Bayes 
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factor is sometimes used for the ratio of the marginal probability density under one model 
to the marginal density under a second model. If we label two competing models as Hı and 
Hz, then the ratio of their posterior probabilities is 


H: H: 
p(Haly) = p(Ha) x Bayes factor(H2; Hı), 


P(Aily) p(Hı) 


where 
p(y|H2) _ J p(02|H2)p(yl02, H2)db2 


P(y|Hi) f p(1|Hi)p(y|@1, H1)d0, ` 


In many cases, the competing models have a common set of parameters, but this is not 
necessary; hence the notation 6; for the parameters in model H;. As expression (7.16) 
makes clear, the Bayes factor is only defined when the marginal density of y under each 
model is proper. 

The goal when using Bayes factors is to choose a single model H; or average over a 
discrete set using their posterior probabilities, p(H;|y). As we show in examples in this 
book, we generally prefer to replace a discrete set of models with an expanded continuous 
family. The bibliographic note at the end of the chapter provides pointers to more extensive 
treatments of Bayes factors. 

Bayes factors can work well when the underlying model is truly discrete and for which 
it makes sense to consider one or the other model as being a good description of the data. 
We illustrate with an example from genetics. 


Bayes factor(H2; Hı) = (7.16) 


Example. A discrete example in which Bayes factors are helpful 

The Bayesian inference for the genetics example in Section 1.4 can be fruitfully ex- 
pressed using Bayes factors, with the two competing ‘models’ being Hı: the woman 
is affected, and Hz: the woman is unaffected, that is, 0=1 and 0 =0 in the notation 
of Section 1.4. The prior odds are p(H2)/p(H1) = 1, and the Bayes factor of the data 
that the woman has two unaffected sons is p(y|H2)/p(y| H1) = 1.0/0.25. The posterior 
odds are thus p(H2|y)/p(Aily) = 4. Computation by multiplying odds ratios makes 
the accumulation of evidence clear. 

This example has two features that allow Bayes factors to be helpful. First, each of the 
discrete alternatives makes scientific sense, and there are no obvious scientific models 
in between. Second, the marginal distribution of the data under each model, p(y|H;), 
is proper. 


Bayes factors do not work so well for models that are inherently continuous. For example, 
we do not like models that assign a positive probability to the event 0 = 0, if 0 is some 
continuous parameter such as a treatment effect. Similarly, if a researcher expresses interest 
in comparing or choosing among various discrete regression models (the problem of variable 
selection), we would prefer to include all the candidate variables, using a prior distribution 
to partially pool the coefficients to zero if this is desired. To illustrate the problems with 
Bayes factors for continuous models, we use the example of the no-pooling and complete- 
pooling models for the 8 schools problem. 


Example. A continuous example where Bayes factors are a distraction 
We now consider a case in which discrete model comparisons and Bayes factors distract 
from scientific inference. Suppose we had analyzed the data in Section 5.5 from the 8 
schools using Bayes factors for the discrete collection of previously proposed standard 
models, no pooling (Hı) and complete pooling (H2): 


S 


Ay: p(y|O1,..., 07 = [[N (y;|0;,07 3) p(01,...,0J) x1 


J=l 
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J 
Hp: p(ylO1,---,47) = [] N(yjl9),07), 01 =---=07=8, pO) x 1. 
j=l 


(Recall that the standard deviations gj are assumed known in this example.) 

If we use Bayes factors to choose or average among these models, we are immediately 
confronted with the fact that the Bayes factor—the ratio p(y|H1)/p(y|H2)—is not 
defined; because the prior distributions are improper, the ratio of density functions 
is 0/0. Consequently, if we wish to continue with the approach of assigning posterior 
probabilities to these two discrete models, we must consider (1) proper prior distri- 
butions, or (2) improper prior distributions that are carefully constructed as limits of 
proper distributions. In either case, we shall see that the results are unsatisfactory. 
More explicitly, suppose we replace the flat prior distributions in Hı and Hə by inde- 
pendent normal prior distributions, N(0, A”), for some large A. The resulting posterior 
distribution for the effect in school 7 is 


P;ly) = Q — Apl; ly, H1) + àp(0;ly, H2), 


where the two conditional posterior distributions are normal centered near y; and J, 
respectively, and À is proportional to the prior odds times the Bayes factor, which is 
a function of the data and A (see Exercise 7.4). The Bayes factor for this problem is 
highly sensitive to the prior variance, A?; as A increases (with fixed data and fixed prior 
odds, p(H2)/p(H1)) the posterior distribution becomes more and more concentrated 
on H2, the complete pooling model. Therefore, the Bayes factor cannot be reasonably 
applied to the original models with noninformative prior densities, even if they are 
carefully defined as limits of proper prior distributions. 

Yet another problem with the Bayes factor for this example is revealed by considering 
its behavior as the number of schools being fitted to the model increases. The posterior 
distribution for 0; under the mixture of Hı and Hə turns out to be sensitive to the 
dimensionality of the problem, as much different inferences would be obtained if, for 
example, the model were applied to similar data on 80 schools (see Exercise 7.4). It 
makes no scientific sense for the posterior distribution to be highly sensitive to aspects 
of the prior distributions and problem structure that are scientifically incidental. 
Thus, if we were to use a Bayes factor for this problem, we would find a problem in the 
model-checking stage (a discrepancy between posterior distribution and substantive 
knowledge), and we would be moved toward setting up a smoother, continuous family 
of models to bridge the gap between the two extremes. A reasonable continuous family 
of models is yj ~ N(0;,05), 0; ~ N(u,77), with a flat prior distribution on p, and 
T in the range [0,00); this is the model we used in Section 5.5. Once the continuous 
expanded model is fitted, there is no reason to assign discrete positive probabilities to 
the values T = 0 and 7 = œ, considering that neither makes scientific sense. 


7.5 Continuous model expansion 
Sensitivity analysis 


In general, the posterior distribution of the model parameters can either overestimate or 
underestimate different aspects of ‘true’ posterior uncertainty. The posterior distribution 
typically overestimates uncertainty in the sense that one does not, in general, include all of 
one’s substantive knowledge in the model; hence the utility of checking the model against 
one’s substantive knowledge. On the other hand, the posterior distribution underestimates 
uncertainty in two senses: first, the assumed model is almost certainly wrong—hence the 
need for posterior model checking against the observed data—and second, other reasonable 
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models could have fit the observed data equally well, hence the need for sensitivity analysis. 
We have already addressed model checking. In this section, we consider the uncertainty 
in posterior inferences due to the existence of reasonable alternative models and discuss 
how to expand the model to account for this uncertainty. Alternative models can differ 
in the specification of the prior distribution, in the specification of the likelihood, or both. 
Model checking and sensitivity analysis go together: when conducting sensitivity analysis, 
it is only necessary to consider models that fit substantive knowledge and observed data in 
relevant ways. 

The basic method of sensitivity analysis is to fit several probability models to the same 
problem. It is often possible to avoid surprises in sensitivity analyses by replacing improper 
prior distributions with proper distributions that represent substantive prior knowledge. In 
addition, different questions are differently affected by model changes. Naturally, poste- 
rior inferences concerning medians of posterior distributions are generally less sensitive to 
changes in the model than inferences about means or extreme quantiles. Similarly, predic- 
tive inferences about quantities that are most like the observed data are most reliable; for 
example, in a regression model, interpolation is typically less sensitive to linearity assump- 
tions than extrapolation. It is sometimes possible to perform a sensitivity analysis by using 
‘robust’ models, which ensure that unusual observations (or larger units of analysis in a 
hierarchical model) do not exert an undue influence on inferences. The typical example is 
the use of the t distribution in place of the normal (either for the sampling or the popula- 
tion distribution). Such models can be useful but require more computational effort. We 
consider robust models in Chapter 17. 


Adding parameters to a model 


There are several possible reasons to expand a model: 


1. If the model does not fit the data or prior knowledge in some important way, it should 
be altered in some way, possibly by adding enough new parameters to allow a better fit. 


2. If a modeling assumption is questionable or has no real justification, one can broaden 
the class of models (for example, replacing a normal by a t, as we do in Section 17.4 for 
the SAT coaching example). 


3. If two different models, p(y, 0) and po(y,@), are under consideration, they can be com- 
bined into a larger model using a continuous parameterization that includes the original 
models as special cases. For example, the hierarchical model for SAT coaching in Chapter 
5 is a continuous generalization of the complete-pooling (r =0) and no-pooling (T =co) 
models. 


4. A model can be expanded to include new data; for example, an experiment previously 
analyzed on its own can be inserted into a hierarchical population model. Another 
common example is expanding a regression model of y|x to a multivariate model of (x, y) 
in order to model missing data in x (see Chapter 18). 


All these applications of model expansion have the same mathematical structure: the old 
model, p(y,@), is embedded in or replaced by a new model, p(y, 6,¢) or, more generally, 
p(y, y*,9,¢), where y* represents the added data. 
The joint posterior distribution of the new parameters, ¢, and the parameters 0 of the 
old model is, 
PO, oly, y") x p(o)p(|o)r(y, y"|8, $). 


The conditional prior distribution, p(0|¢), and the likelihood, p(y, y*|0,¢), are determined 
by the expanded family. The marginal distribution of ¢ is obtained by averaging over 6: 


p(dly,y") x p(o) / p(6ld)p(y, y“ l0, $)d. (7.17) 
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In any expansion of a Bayesian model, one must specify a set of prior distributions, 
p(6|@), to replace the old p(@), and also a hyperprior distribution p(#) on the hyperparame- 
ters. Both tasks typically require thought, especially with noninformative prior distributions 
(see Exercises 6.7 and 6.5). For example, Section 14.7 discusses a model for unequal vari- 
ances that includes unweighted and weighted linear regression as extreme cases. In Section 
17.4, we illustrate the task of expanding the normal model for the SAT coaching example of 
Section 5.5 to a t model by including the degrees of freedom of the ¢ distribution as an ad- 
ditional hyperparameter. Another detailed example of model expansion appears in Section 
22.2, for a hierarchical mixture model applied to data from an experiment in psychology. 


Accounting for model choice in data analysis 


We typically construct the final form of a model only after extensive data analysis, which 
leads to concerns that are related to the classical problems of multiple comparisons and 
estimation of prediction error. As discussed in Section 4.5, a Bayesian treatment of multiple 
comparisons uses hierarchical modeling, simultaneously estimating the joint distribution of 
all possible comparisons and shrinking these as appropriate (for example, in the analysis of 
the eight schools, the @;’s are all shrunk toward ju, so the differences 0; — 0% are automatically 
shrunk toward 0). Nonetheless, some potential problems arise, such as the possibility of 
performing many analyses on a single dataset in order to find the strongest conclusion. This 
is a danger with all applied statistical methods and is only partly alleviated by the Bayesian 
attempt to include all sources of uncertainty in a model. 


Selection of predictors and combining information 


In regression problems there are generally many different reasonable-seeming ways to set up 
a model, and these different models can give dramatically different answers (as we illustrate 
in Section 9.2 in an analysis of the effects of incentives on survey response). Putting together 
existing information in the form of predictors is nearly always an issue in observational 
studies (see Section 8.6), and can be seen as a model specification issue. Even when only a 
few predictors are available, we can choose among possible transformations and interactions. 

As we shall discuss in Sections 14.6 and 15.6, we prefer including as many predictors 
as possible in a regression and then scaling and batching them into an analysis-of-variance 
structure, so that they are all considered to some extent rather than being discretely ‘in’ 
or ‘out.’ Even so, choices must be made in selecting the variables to be included in the 
hierarchical model itself. Bayesian methods for discrete model averaging may be helpful 
here, although we have not used this approach in our own research. 

A related and more fundamental issue arises when setting up regression models for causal 
inference in observational studies. Here, the relations among the variables in the substantive 
context are relevant, as in principal stratification methods (see Section 8.6), where, after 
the model is constructed, additional analysis is required to compute causal estimands of 
interest, which are not in general the same as the regression coefficients. 


Alternative model formulations 


We often find that adding a parameter to a model makes it much more flexible. For example, 
in a normal model, we prefer to estimate the variance parameter rather than set it to a pre- 
chosen value. At the next stage, the ¢ model is more flexible than the normal (see Chapter 
17), and this has been shown to make a practical difference in many applications. But why 
stop there? There is always a balance between accuracy and convenience. As discussed in 
Chapter 6, predictive model checks can reveal serious model misfit, but we do not yet have 
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good general principles to justify our basic model choices. As computation of hierarchical 
models becomes more routine, we may begin to use more elaborate models as defaults. 


Practical advice for model checking and expansion 


It is difficult to give appropriate general advice for model choice; as with model building, 
scientific judgment is required, and approaches must vary with context. 

Our recommended approach, for both model checking and sensitivity analysis, is to 
examine posterior distributions of substantively important parameters and predicted quan- 
tities. Then we compare posterior distributions and posterior predictions with substantive 
knowledge, including the observed data, and note where the predictions fail. Discrepancies 
should be used to suggest possible expansions of the model, perhaps as simple as putting 
real prior information into the prior distribution or adding a parameter such as a nonlinear 
term in a regression, or perhaps requiring some substantive rethinking, as for the poor pre- 
diction of the southern states in the presidential election model as displayed in Figure 6.1 
on page 143. 

Sometimes a model has stronger assumptions than are immediately apparent. For ex- 
ample, a regression with many predictors and a flat prior distribution on the coefficients will 
tend to overestimate the variation among the coefficients, just as the independent estimates 
for the eight schools were more spread than appropriate. If we find that the model does not 
fit for its intended purposes, we are obliged to search for a new model that fits; an analysis 
is rarely, if ever, complete with simply a rejection of some model. 

If a sensitivity analysis reveals problems, the basic solution is to include the other 
plausible models in the prior specification, thereby forming a posterior inference that re- 
flects uncertainty in the model specification, or simply to report sensitivity to assumptions 
untestable by the data at hand. And one must sometimes conclude that, for practical pur- 
poses, available data cannot effectively answer some questions. In other cases, it is possible 
to add information to constrain the model enough to allow useful inferences; Section 7.6 
presents an example in the context of a simple random sample from a nonnormal population, 
in which the quantity of interest is the population total. 


7.6 Implicit assumptions and model expansion: an example 


Despite our best efforts to include information, all models are approximate. Hence, checking 
the fit of a model to data and prior assumptions is always important. For the purpose 
of model evaluation, we can think of the inferential step of Bayesian data analysis as a 
sophisticated way to explore all the implications of a proposed model, in such a way that 
these implications can be compared with observed data and other knowledge not included 
in the model. For example, Section 6.4 illustrates graphical predictive checks for models 
fitted to data for two different problems in psychological research. In each case, the fitted 
model captures a general pattern of the data but misses some key features. In the second 
example, finding the model failure leads to a model improvement—a mixture distribution 
for the patient and symptom parameters—that better fits the data, as seen in Figure 6.10. 

Posterior inferences can often be summarized graphically. For simple problems or one or 
two-dimensional summaries, we can plot a histogram or scatterplot of posterior simulations, 
as in Figures 3.2, 3.3, and 5.8. For larger problems, summary graphs such as Figures 5.4— 
5.7 are useful. Plots of several independently derived inferences are useful in summarizing 
results so far and suggesting future model improvements. We illustrate in Figure 14.2 with 
a series of estimates of the advantage of incumbency in congressional elections. 

When checking a model, one must keep in mind the purposes for which it will be used. 
For example, the normal model for football scores in Section 1.6 accurately predicts the 
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Population Sample 1 Sample 2 
(N = 804) (n=100) (n= 100) 


total 13,776,663 1,966,745 3,850,502 
mean 17,135 19,667 38,505 
sd 139,147 142,218 228,625 
lowest 19 164 162 
5% 336 308 315 
25% 800 891 863 
median 1,668 2,081 1,740 
75% 5,050 6,049 5,239 
95% 30,295 25,130 41,718 


highest 2,627,319 1,424,815 1,809,578 


Table 7.2 Summary statistics for populations of municipalities in New York State in 1960 (New 
York City was represented by its five boroughs); all 804 municipalities and two independent simple 
random samples of 100. From Rubin (1988a). 


probability of a win, but gives poor predictions for the probability that a game is exactly 
tied (see Figure 1.1). 

We should also know the limitations of automatic Bayesian inference. Even a model 
that fits observed data well can yield poor inferences about some quantities of interest. It is 
surprising and instructive to see the pitfalls that can arise when models are not subjected 
to model checks. 


Example. Estimating a population total under simple random sampling 
using transformed normal models 

We consider the problem of estimating the total population of the N = 804 munic- 
ipalities in New York State in 1960 from a simple random sample of n = 100—an 
artificial example, but one that illustrates the role of model checking in avoiding 
seriously wrong inferences. Table 7.2 summarizes the population of this ‘survey’ along 
with two simple random samples (which were the first and only ones chosen). With 
knowledge of the population, neither sample appears particularly atypical; sample 
1 is representative of the population according to the summary statistics provided, 
whereas sample 2 has a few too many large values. Consequently, it might at first 
glance seem straightforward to estimate the population total, perhaps overestimating 
the total from the second sample. 


Sample 1: initial analysis. We begin by trying to estimate the population total from 
sample 1 assuming that the N values in the population were drawn from a N(u, o°) 
superpopulation, with a uniform prior density on (u, logo). To use notation introduced 
more formally in Chapter 8, we wish to estimate the finite-population quantity, 


Ytotal = NY = NYobs + (N = n)Tmis» (7.18) 


where Yops is the average for the 100 observed municipalities, and Yi; is the average 
for the 704 others. As we discuss in Section 8.3, under this model, the posterior 
distribution of y is th—1(Gops: (+ — #)S82ps)- Using the data from the second column 
of Table 7.2 and the tabulated ¢ distribution, we obtain the following 95% posterior 
interval for ytotai: [—5.4 x 10°,37.0 x 10°]. The practical person examining this 95% 
interval might find the upper limit useful and simply replace the lower limit by the 
total in the sample, since the total in the population can be no less. This procedure 
gives a 95% interval estimate of [2.0 x 10°, 37.0 x 10°]. 

Surely, modestly intelligent use of statistical models should produce a better answer 
because, as we can see in Table 7.2, both the population and sample 1 are far from 
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normal, and the standard interval is most appropriate with normal populations. More- 
over, all values in the population are known ahead of time to be positive. 

We repeat the above analysis under the assumption that the N = 804 values in the 
complete data follow a lognormal distribution: logy; ~ N(u,o7), with a uniform 
prior distribution on (u, logo). Posterior inference for Ytota] is performed in the usual 
manner: drawing (11,0) from their posterior (normal-inverse-y?) distribution, then 
drawing Ymis|4, o from the predictive distribution, and finally calculating Ytota] from 
(7.18). Based on 100 simulation draws, the 95% interval for ytotai is [5.4 10°, 9.9x 10°. 
This interval is narrower than the original interval and at first glance looks like an 
improvement. 


Sample 1: checking the lognormal model. One of our major principles is to check 
the fit of our models. Because we are interested in a population total, Ytotal, we 
apply a posterior predictive check using, as a test statistic, the total in the sample, 
T(Yoos) = Xi] Yobsi- Using our S = 100 sample draws of (u,07) from the posterior 
distribution under the lognormal model, we obtain posterior predictive simulations of 
S independent replicated datasets, y’}2, and compute T (yik) = oi, Vaik; for each. 
The result is that, for this predictive quantity, the lognormal model is unacceptable: 
all of the S = 100 simulated values are lower than the actual total in the sample, 
1,966,745. 


Sample 1: extended analysis. A natural generalization beyond the lognormal model 
for municipality sizes is the power-transformed normal family, which adds an addi- 
tional parameter, ¢, to the model; see (7.19) on page 194 for details. The values ¢ = 1 
and 0 correspond to the untransformed normal and lognormal models, respectively, 
and other values correspond to other transformations. 

To fit a transformed normal family to data yops, the easiest computational approach is 
to fit the normal model to transformed data at several values of ¢ and then compute 
the marginal posterior density of ¢. Using the data from sample 1, the marginal 
posterior density of @ is strongly peaked around the value -4 (assuming a uniform 
prior distribution for ¢, which is reasonable given the relatively informative likelihood). 
Based on 100 simulated values under the extended model, the 95% interval for Ytotal 
is [5.8 x 10°, 31.8 x 10°]. With respect to the posterior predictive check, 15 out of 100 
simulated replications of the sample total are larger than the actual sample total; the 
model fits adequately in this sense. 

Perhaps we have learned how to apply Bayesian methods successfully to estimate a 
population total with this sort of data: use a power-transformed family and summa- 
rize inference by simulation draws. But we did not conduct a rigorous test of this 
conjecture. We started with the log transformation and obtained an inference that 
initially looked respectable, but we saw that the posterior predictive check indicated 
a lack of fit in the model with respect to predicting the sample total. We then en- 
larged the family of transformations and performed inference under the larger model 
(or, equivalently in this case, found the best-fitting transformation, since the trans- 
formation power was so precisely estimated by the data). The extended procedure 
seemed to work in the sense that the 95% interval was plausible; moreover, the poste- 
rior predictive check on the sample total was acceptable. To check on this extended 
procedure, we try it on the second random sample of 100. 


Sample 2. The standard normal-based inference for the population total from the 
second sample yields a 95% interval of [—3.4 x 10°,65.3 x 10°]. Substituting the 
sample total for the lower limit gives the wide interval of [3.9 x 10°, 65.3 x 10°]. 

Following the steps used on sample 1, modeling the sample 2 data as lognormal leads 
to a 95% interval for Ytota] of [8.2 x 10°, 19.6 x 10°]. The lognormal inference is tight. 
However, in the posterior predictive check for sample 2 with the lognormal model, 
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none of 100 simulations of the sample total was as large as the observed sample total, 
and so once again we find this model unsuited for estimation of the population total. 
Based upon our experience with sample 1, and the posterior predictive checks under 
the lognormal models for both samples, we should not trust the lognormal interval 
and instead should consider the general power family, which includes the lognormal 
as a special case. For sample 2, the marginal posterior distribution for the power 
parameter ¢ is strongly peaked at —i. The posterior predictive check generated 48 of 
100 sample totals larger than the observed total—no indication of any problems, at 
least if we do not examine the specific values being generated. 

In this example we have the luxury of knowing the correct value (the actual total 
population of 13.8 million), and from this standpoint the inference for the population 
total under the power family turns out to be atrocious: for example, the median of the 
100 generated values of ytotal is 57 x 107, the 97th value is 14 x 10, and the largest 
value generated is 12 x 101”. 


Need to specify crucial prior information. What is going on? How can the inferences 
for the population total in sample 2 be so much less realistic with a better-fitting 
model (that is, assuming a normal distribution for y; 1 ‘j 
model (that is, assuming a normal distribution for log y;)? 
The problem with the inferences in this example is not an inability of the models to 
fit the data, but an inherent inability of the data to distinguish between alternative 
models that have different implications for estimation of the population total, yotal- 
Estimates of Ytotaı depend strongly on the upper extreme of the distribution of munic- 
ipality sizes, but as we fit models like the power family, the right tail of these models 
(especially beyond the 99.5% quantile), is being affected dramatically by changes gov- 
erned by the fit of the model to the main body of the data (between the 0.5% and 
99.5% quantiles). The inference for ytotai is actually critically dependent upon tail 
behavior beyond the quantile corresponding to the largest observed yops;. In order to 
estimate the total (or the mean), not only do we need a model that reasonably fits the 
observed data, but we also need a model that provides realistic extrapolations beyond 
the region of the data. For such extrapolations, we must rely on prior assumptions, 
such as specification of the largest possible size of a municipality. 

More explicitly, for our two samples, the three parameters of the power family are 
basically enough to provide a reasonable fit to the observed data. But in order to 
obtain realistic inferences for the population of New York State from a simple random 
sample of size 100, we must constrain the distribution of large municipalities. We were 
warned, in fact, by the specific values of the posterior simulations for the sample total 
from sample 2, where 10 of the 100 simulations for the replicated sample total were 
larger than 300 million! 

The substantive knowledge that is used to criticize the power-transformed normal 
model can also be used to improve the model. Suppose we know that no single 
municipality has population greater than 5 x 10°. To include this information in the 
model, we simply draw posterior simulations in the same way as before but truncate 
municipality sizes to lie below that upper bound. The resulting posterior inferences for 
total population size are reasonable. For both samples, the inferences for Ytotaı under 
the power family are tighter than with the untruncated models and are realistic. The 
95% intervals under samples 1 and 2 are [6 x 10°, 20 x 10°] and [10 x 10°, 34 x 10°], 
respectively. Incidentally, the true population total is 13.7 x 10° (see Table 7.2), which 
is included in both intervals. 


than with a worse-fitting 


Why does the untransformed normal model work reasonably well for estimating the 
population total? The inferences for Ytota, based on the simple untransformed normal 
model for y; are not terrible, even without supplying an upper bound for municipal- 
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ity size. Why? The estimate for ytotal under the normal model is essentially based 
only on the assumed normal sampling distribution for Jeps and the corresponding x? 
sampling distribution for s?,,. In order to believe that these sampling distributions 
are approximately valid, we need the central limit theorem to apply, which we achieve 
by implicitly bounding the upper tail of the distribution for y; enough to make ap- 
proximate normality work for a sample size of 100. This is not to suggest that we 
recommend the untransformed normal model for clearly nonnormal data; in the exam- 
ple considered here, the bounded power-transformed family makes more efficient use 
of the data. In addition, the untransformed normal model gives extremely poor infer- 
ences for estimands such as the population median. In general, a Bayesian analysis 
that limits large values of y; must do so explicitly. 


Well-designed samples or robust questions obviate the need for strong prior informa- 
tion. Extensive modeling and simulation are not needed to estimate totals routinely 
in practice. Good survey practitioners know that a simple random sample is not a 
good survey design for estimating the total in a highly skewed population. If stratifi- 
cation variables were available, one would prefer to oversample the large municipalities 
(for example, sample all five boroughs of New York City, a large proportion of cities, 
and a smaller proportion of towns). 


Inference for the population median. It should not be overlooked, however, that the 
simple random samples we drew, although not ideal for estimating the population 
total, are satisfactory for answering many questions without imposing strong prior 
restrictions. 

For example, consider inference for the median size of the 804 municipalities. Using 
the data from sample 1, the simulated 95% posterior intervals for the median mu- 
nicipality size under the three models: (a) lognormal, (b) power-transformed normal 
family, and (c) power-transformed normal family truncated at 5 x 10°, are [1800, 3000], 
(1600, 2700], and [1600, 2700], respectively. The comparable intervals based on sample 
2 are (1700, 3600], [1300, 2400], and [1200, 2400]. In general, better models tend to give 
better answers, but for questions that are robust with respect to the data at hand, 
such as estimating the median from our simple random sample of size 100, the effect 
is rather weak. For such questions, prior constraints are not extremely critical and 
even relatively inflexible models can provide satisfactory answers. Moreover, the pos- 
terior predictive checks for the sample median looked fine—with the observed sample 
median near the middle of the distribution of simulated sample medians—for all these 
models (but not for the untransformed normal model). 


What general lessons have we learned from considering this example? The first two 
messages are specific to the example and address accuracy of inferences for covering the 
true population total. 


1. The lognormal model may yield inaccurate inferences for the population total even when 
it appears to fit observed data fairly well. 


2. Extending the lognormal family to a larger, and so better-fitting, model such as the power 
transformation family, may lead to less realistic inferences for the population total. 


These two points are not criticisms of the lognormal distribution or power transforma- 
tions. Rather, they provide warnings when using a model that has not been subjected 
to posterior predictive checks (for test variables relevant to the estimands of interest) and 
reality checks. In this context, the naive statement, ‘better fits to data mean better models 
which in turn mean better real-world answers,’ is not necessarily true. Statistical answers 
rely on prior assumptions as well as data, and better real-world answers generally require 
models that incorporate more realistic prior assumptions (such as bounds on municipality 
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sizes) as well as provide better fits to data. This comment naturally leads to a general 
message encompassing the first two points. 


3. In general, inferences may be sensitive to features of the underlying distribution of values 
in the population that cannot be addressed by the observed data. Consequently, for good 
statistical answers, we not only need models that fit observed data, but we also need: 


(a) flexibility in these models to allow specification of realistic underlying features not 
adequately addressed by observed data, such as behavior in the extreme tails of the 
distribution, or 

(b) questions that are robust for the type of data collected, in the sense that all relevant 
underlying features of population values are adequately addressed by the observed 
values. 


Finding models that satisfy (a) is a more general approach than finding questions that 
satisfy (b) because statisticians are often presented with hard questions that require answers 
of some sort, and do not have the luxury of posing easy (that is, robust) questions in their 
place. For example, for environmental reasons it may be important to estimate the total 
amount of pollutant being emitted by a manufacturing plant using samples of the soil from 
the surrounding geographical area, or, for purposes of budgeting a health-care insurance 
program, it may be necessary to estimate the total amount of medical expenses from a 
sample of patients. Such questions are inherently nonrobust in that their answers depend 
on the behavior in the extreme tails of the underlying distributions. Estimating more 
robust population characteristics, such as the median amount of pollutant in soil samples 
or the median medical expense for patients, does not address the essential questions in such 
examples. 

Relevant inferential tools, whether Bayesian or non-Bayesian, cannot be free of assump- 
tions. Robustness of Bayesian inference is a joint property of data, prior knowledge, and 
questions under consideration. For many problems, statisticians may be able to define the 
questions being studied so as to have robust answers. Sometimes, however, the practical, 
important question is inescapably nonrobust, with inferences being sensitive to assump- 
tions that the data at hand cannot address, and then a good Bayesian analysis expresses 
this sensitivity. 


7.7 Bibliographic note 


Some references to Bayesian approaches to cross-validation and predictive error include 
Geisser and Eddy (1979), Gelfand, Dey, and Chang (1992), Bernardo and Smith (1994), 
George and Foster (2000), and Vehtari and Lampinen (2002). Arlot, and Celisse (2010) 
provide recent review of cross-validation in a generic (non-Bayesian) context. The first 
order bias correction for cross-validation described in this chapter was proposed by Bur- 
man (1989); see also Tibshirani and Tibshirani (2009). Fushiki (2011) has proposed an 
alternative approach to compute bias correction for cross-validation. 

Geisser (1986) discusses predictive inference and model checking in general, Barbieri 
and Berger (2004) discuss Bayesian predictive model selection, and Vehtari and Ojanen 
(2012) present an extensive review of Bayesian predictive model assessment and selection 
methods, and of methods closely related to them. Piironen and Vehtari (2017) provide a 
complementary quantative comparison of Bayesian predictive methods for model selection 
and discuss the bias induced by model selection. Nelder and Wedderburn (1972) explore 
the deviance as a measure of model fit, Akaike (1973) introduces the expected predictive 
deviance and the AIC, and Mallows (1973) derives the related Cp measure. Hansen and 
Yu (2001) review related ideas from an information-theoretic perspective. Gneiting and 
Raftery (2007) review scoring rules for probabilistic prediction and Gneiting (2011) reviews 
scoring functions for point prediction. 
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The deviance information criterion (DIC) and its calculation using posterior simulations 
are described by Spiegelhalter et al. (2002); see also van der Linde (2005) and Plummer 
(2008). Burnham and Anderson (2002) discuss and motivate the use of the Kullback-Leibler 
divergence for model comparison, which relates to the log predictive density used to sum- 
marize predictive accuracy. The topic of counting parameters in nonlinear, constrained, and 
hierarchical models is discussed by Hastie and Tibshirani (1990), Moody (1992), Gelman, 
Meng, and Stern (1996), Hodges and Sargent (2001), and Vaida and Blanchard (2002). 
The last paper discusses the different ways that information criteria can be computed in 
hierarchical models. 

Our discussion of predictive information criteria and cross-validation is taken from Gel- 
man, Hwang, and Vehtari (2014). Vehtari, Gelman, and Gabry (2017) present efficient 
computation of LOO-CV using Pareto-smoothed importance sampling, and Vehtari et al. 
(2016) study efficient computation of LOO-CV for Gaussian latent variable models. Watan- 
abe (2010) presents WAIC. Watanabe (2013) presents also a widely applicable Bayesian 
information criterion (WBIC) version of BIC which works also in singular and unrealizable 
cases. Singular learning theory used to derive WAIC and WBIC is presented in Watanabe 
(2009). 

Proofs for asymptotic equalities of various information criteria and LOO-CV have been 
shown by Stone (1997), Shibata (1989), and Watanabe (2010). 

Ando and Tsay (2010) have proposed an information criterion for the joint prediction, 
but its bias correction has the same computational difficulties as many other extensions of 
AIC and it cannot be compared to cross-validation, since it is not possible to leave n data 
points out in the cross-validation approach. 

Vehtari and Ojanen (2012) discuss different prediction scenarios where the future ex- 
planatory variable č is assumed to be random, unknown, fixed, shifted, deterministic, or 
constrained in some way. Here we dicussed only scenarios with no x, p(Z) is equal to p(x), 
or & is equal to x. Variations of cross-validation and hold-out methods can be used for more 
complex scenarios. 

Calibration of differences in log predictive densities is discussed, e.g., by McCulloch 
(1989). 

A comprehensive overview of the use of Bayes factors for comparing models and testing 
scientific hypotheses is given by Kass and Raftery (1995), which contains many further 
references in this area. Pauler, Wakefield, and Kass (1999) discuss Bayes factors for hi- 
erarchical models. Weiss (1996) considers the use of Bayes factors for sensitivity analysis. 
Chib (1995) and Chib and Jeliazkov (2001) describe approaches for calculating the marginal 
densities required for Bayes factors from iterative simulation output (as produced by the 
methods described in Chapter 11). 

Bayes factors are not defined for models with improper prior distributions, but there 
have been several attempts to define analogous quantities; see Spiegelhalter and Smith 
(1982) and Kass and Raftery (1995). A related proposal is to treat Bayes factors as posterior 
probabilities and then average over competing models—see Raftery (1996a) for a theoretical 
treatment, Rosenkranz and Raftery (1994) for an application, and Hoeting et al. (1999) and 
Chipman, George, and McCulloch (2001) for reviews. Carlin and Chib (1993) discuss the 
problem of averaging over models that have incompatible parameterizations. 

There are many examples of applied Bayesian analyses in which sensitivity to the model 
has been examined, for example Racine et al. (1986), Weiss (1994), and Smith, Spiegelhalter, 
and Thomas (1995). Calvin and Sedransk (1991) provide an example comparing various 
Bayesian and non-Bayesian methods of model checking and expansion. 

A variety of views on model selection and averaging appear in the articles by Draper 
(1995) and O’Hagan (1995) and the accompanying discussions. We refer the reader to these 
articles and their references for further discussion and examples of these methods. Because 
we emphasize continuous families of models rather than discrete choices, Bayes factors are 
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rarely relevant in our approach to Bayesian statistics; see Raftery (1995) and Gelman and 
Rubin (1995) for two contrasting views on this point. 
The final section of this chapter is an elaboration of Rubin (1983a). 


7.8 Exercises 


1. Predictive accuracy and cross-validation: Compute AIC, DIC, WAIC, and cross-validation 
for the logistic regression fit to the bioassay example of Section 3.7. 


2. Information criteria: show that DIC yields an estimate of elpd that is correct in expec- 
tation, in the case of normal models or in the asymptotic limit of large sample sizes (see 
Spiegelhalter et al., 2002, p. 604). 


3. Predictive accuracy for hierarchical models: Compute AIC, DIC, WAIC, and cross- 
validation for the meta-analysis example of Section 5.6. 


4. Bayes factors when the prior distribution is improper: on page 183, we discuss Bayes 
factors for comparing two extreme models for the SAT coaching example. 


(a) Derive the Bayes factor, p(H2|y)/p(Aily), as a function of y1,...,y7, 01,---,07, and 
A, for the models with N(0, A?) prior distributions. 


(b) Evaluate the Bayes factor in the limit A — oo. 

(c) For fixed A, evaluate the Bayes factor as the number of schools, J, increases. Assume 
for simplicity that oj =---= oJ = g, and that the sample mean and variance of the 
yj’s do not change. 


5. Power-transformed normal models: A natural expansion of the family of normal dis- 
tributions, for all-positive data, is through power transformations, which are used in 
various contexts, including regression models. For simplicity, consider univariate data 
y = (y1,---;Yn), that we wish to model as independent and identically normally dis- 
tributed after transformation. 

Box and Cox (1964) propose the model, ys ~ N(u,07), where 


() J W?-1/d for ¢40 
a log yi for @ = 0. (7.19) 


The parameterization in terms of y” allows a continuous family of power transformations 
that includes the logarithm as a special case. To perform Bayesian inference, one must 
set up a prior distribution for the parameters, (u, 0, ¢). 


(a) It seems natural to apply a prior distribution of the form p(y, log o, ¢) « p(ġ), where 
p(@) is a prior distribution (perhaps uniform) on ¢ alone. Unfortunately, this prior 
distribution leads to unreasonable results. Set up a numerical example to show why. 
(Hint: consider what happens when all the data points y; are multiplied by a constant 
factor.) 

(b) Box and Cox (1964) propose a prior distribution that has the form p(u,0,¢) œ 
ýl ple), where y = (JIi; yi)'/". Show that this prior distribution eliminates the 
problem in (a). 

(c) Write the marginal posterior density, p(¢|y), for the model in (b). 

(d) Discuss the implications of the fact that the prior distribution in (b) depends on the 
data. 


(e) The power transformation model is used with the understanding that negative values 
($) 


See Pericchi (1981) and Hinkley and Runger (1984) for further discussion of Bayesian 
analysis of power transformations. 


of y; ` are not possible. Discuss the effect of the implicit truncation on the model. 
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County Radon measurements (pCi/L) 
Blue Earth 5.0, 13.0, 7.2, 6.8, 12.8, 5.8", 9.5, 6.0, 3.8, 14.3", 
1.8, 6.9, 4.7, 9.5 
Clay 0.9*, 12.9, 2.6, 3.5*, 26.6, 1.5, 13.0, 8.8, 19.5, 2.5%, 


9.0, 13.1, 3.6, 6.9 
Goodhue 14.3, 6.9", 7.6, 9.8*, 2.6, 43.5, 4.9, 3.5, 4.8, 5.6, 
3.5, 3.9, 6.7 


Table 7.3 Short-term measurements of radon concentration (in picoCuries/liter) in a sample of 
houses in three counties in Minnesota. All measurements were recorded on the basement level of 
the houses, except for those indicated with asterisks, which were recorded on the first floor. 


6. Fitting a power-transformed normal model: Table 7.3 gives short-term radon measure- 
ments for a sample of houses in three counties in Minnesota (see Section 9.4 for more 
on this example). For this problem, ignore the first-floor measurements (those indicated 
with asterisks in the table). 

(a) Fit the power-transformed normal model from Exercise 7.5(b) to the basement mea- 
surements in Blue Earth County. 

(b) Fit the power-transformed normal model to the basement measurements in all three 
counties, holding the parameter ġ equal for all three counties but allowing the mean 
and variance of the normal distribution to vary. 

(c) Check the fit of the model using posterior predictive simulations. 

(d) Discuss whether it would be appropriate to simply fit a lognormal model to these data. 

7. Model expansion: consider the t model, y;|u,07,v ~ t,(u,07), as a generalization of 
the normal. Suppose that, conditional on v, you are willing to assign a noninformative 
uniform prior density on (,loga). Construct what you consider a noninformative joint 
prior density on (p, logo, v), for the range v € [1,0o). Address the issues raised in setting 
up a prior distribution for the power-transformed normal model in Exercise 7.5. 
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Chapter 8 


Modeling accounting for data collection 


How does the design of a sample survey, an experiment, or an observational study affect 
the models that we construct for a Bayesian analysis? How does one analyze data from a 
survey that is not a simple random sample? How does one analyze data from hierarchical 
experimental designs such as randomized blocks and Latin squares, or nonrandomly gen- 
erated data in an observational study? If we know that a design is randomized, how does 
that affect our Bayesian inference? In this chapter, we address these questions by showing 
how relevant features of data collection are incorporated in the process of full probability 
modeling. 
This chapter has two general messages: 


e The information that describes how the data were collected should be included in the 
analysis, typically by basing conclusions on a model (such as a regression; see Part IV) 
that is conditional on the variables that describe the data collection process. 


e If partial information is available (for example, knowing that a measurement exceeds 
some threshold but not knowing its exact value, or having missing values for variables 
of interest), then a probability model should be used to relate the partially observed 
quantity or quantities to the other variables of interest. 


Despite the simplicity of our messages, many of the examples in this chapter are math- 
ematically elaborate, even when describing data collected under simple designs such as 
random sampling and randomized experiments. This chapter includes careful theoretical 
development of these simple examples in order to show clearly how general Bayesian model- 
ing principles adapt themselves to particular designs. The purpose of these examples is not 
to reproduce or refine existing classical methods but rather to show how Bayesian methods 
can be adapted to deal with data collection issues such as stratification and clustering in sur- 
veys, blocking in experiments, selection in observational studies, and partial-data patterns 
such as censoring. 


8.1 Bayesian inference requires a model for data collection 


There are a variety of settings where a data analysis must account for the design of data 
collection rather than simply model the observed values directly. These include classical 
statistical designs such as sample surveys and randomized experiments, problems of nonre- 
sponse or missing data, and studies involving censored or truncated data. Each of these, 
and other problems, will be considered in this chapter. 

The key goal throughout is generalizing beyond existing data to a larger population 
representing unrecorded data, additional units not in the study (as in a sample survey), or 
information not recorded on the existing units (as in an experiment or observational study, 
in which it is not possible to apply all treatments to all units). The information used in 
data collection must be included in the analysis, or else inferences will not necessarily be 
appropriate for the general population of interest. 


197 
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A naive student of Bayesian inference might claim that because all inference is condi- 
tional on the observed data, it makes no difference how those data were collected. This 
misplaced appeal to the likelihood principle would assert that given (1) a fixed model (in- 
cluding the prior distribution) for the underlying data and (2) fixed observed values of the 
data, Bayesian inference is determined regardless of the design for the collection of the data. 
Under this view there would be no formal role for randomization in either sample surveys 
or experiments. The essential flaw in the argument is that a complete definition of ‘the 
observed data’ should include information on how the observed values arose, and in many 
situations such information has a direct bearing on how these values should be interpreted. 
Formally then, the data analyst needs to incorporate the information describing the data 
collection process in the probability model used for analysis. 

The notion that the method of data collection is irrelevant to Bayesian analysis can be 
dispelled by the simplest of examples. Suppose for instance that we, the authors, give you, 
the reader, a collection of the outcomes of ten rolls of a die and all are 6’s. Certainly your 
attitude toward the nature of the die after analyzing these data would be different if we 
told you (i) these were the only rolls we performed, versus (ii) we rolled the die 60 times 
but decided to report only the 6’s, versus (iii) we decided in advance that we were going to 
report honestly that ten 6’s appeared but would conceal how many rolls it took, and we had 
to wait 500 rolls to attain that result. In simple situations such as these, it is easy to see 
that the observed data follow a different distribution from that for the underlying ‘complete 
data.’ Moreover, in such simple cases it is often easy to state immediately the marginal 
distribution of the observed data having properly averaged over the posterior uncertainty 
about the missing data. But in general such simplicity is not present. 

More important than these theoretical examples, however, are the applications of the 
theory to analysis of data from surveys, experiments, and observational studies, and with 
different patterns of missing data. We shall discuss some general principles that guide the 
appropriate incorporation of study design and data collection in the process of data analysis: 


1. The data analyst should use all relevant information; the pattern of what has been 
observed can be informative. 


2. Ignorable designs (as defined in Section 8.2)—often based on randomization—are likely 
to produce data for which inferences are less sensitive to model choice, than nonignorable 
designs. 


3. As more explanatory variables are included in an analysis, the inferential conclusions 
become more valid conditionally but possibly more sensitive to the model specifications 
relating the outcomes to the explanatory variables. 


4. Thinking about design and the data one could have observed helps us structure inference 
about models and finite-population estimands such as the population mean in a sample 
survey or the average causal effect of an experimental treatment. In addition, the pos- 
terior predictive checks discussed in Chapter 6 in general explicitly depend on the study 
design through the hypothetical replications, y"°P. 


Generality of the observed- and missing-data paradigm 


Our general framework for thinking about data collection problems is in terms of observed 
data that have been collected from a larger set of complete data (or potential data), leaving 
unobserved missing data. Inference is conditional on observed data and also on the pattern 
of observed and missing observations. We use the expression ‘missing data’ in a general 
sense to include unintentional missing data due to unfortunate circumstances such as survey 
nonresponse, censored measurements, and noncompliance in an experiment, but also inten- 
tional missing data such as data from units not sampled in a survey and the unobserved 
‘potential outcomes’ under treatments not applied in an experiment (see Table 8.1). 


This electronic edition is for non-commercial purposes only. 


8.2. DATA-COLLECTION MODELS AND IGNORABILITY 199 

Example ‘Observed data’ ‘Complete data’ 
Sampling Values from the n Values from all N 

units in the sample units in the population 
Experiment Outcomes under the observed Outcomes under all 

treatment for each unit treated treatments for all units 
Rounded data Rounded observations Precise values of 

all observations 

Unintentional Observed data values Complete data, both 
missing data observed and missing 


Table 8.1: Use of observed- and missing-data terminology for various data structures. 


Section 8.2 defines a general notation for data collection and introduces the concept of 
ignorability, which is crucial in setting up models that correctly account for data collec- 
tion. The rest of this chapter discusses Bayesian analysis in different scenarios: sampling, 
experimentation, observational studies, and unintentional missing data. Chapter 18 goes 
into more detail on multivariate regression models for missing data. 

To develop a basic understanding of the fundamental issues, we need a general formal 
structure in which we can embed the variations as special cases. As we discuss in the next 
section, the key idea is to expand the sample space to include, in addition to the potential 
(complete) data y, an indicator variable I for whether each element of y is observed or not. 


8.2 Data-collection models and ignorability 


In this section, we develop a general notation for observed and potentially observed data. 
As noted earlier, we introduce the notation in the context of missing-data problems but 
apply it to sample surveys, designed experiments, and observational studies in the rest of 
the chapter. In a wide range of problems, it is useful to imagine what would be done if 
all data were completely observed—that is, if a sample survey were a census, or if all units 
could receive all treatments in an experiment, or if no observations were censored (see Table 
8.1). We divide the modeling tasks into two parts: modeling the complete data, y, typically 
using the methods discussed elsewhere in this book, and modeling the variable J, which 
indexes which potential data are observed. 


Notation for observed and missing data 


Let y = (y1,..., Yyy) be the matrix of potential data (each y; may itself be a vector with 
components yij,j =1,...,J, for example if several questions are asked of each respondent 
in a survey), and let J = (,..., Iyn) be a matrix of the same dimensions as y with indicators 
for the observation of y: I;i; = 1 means y;; is observed, whereas J;; = 0 means y;; is missing. 
For notational convenience, let ‘obs’= {(i, j): Ji; = 1} index the observed components of 
y and ‘mis’= {(i, j): Ii; = 0} index the unobserved components of y; for simplicity we 
assume that I itself is always observed. (In a situation in which J is not fully observed, for 
example a sample survey with unknown population size, one can assign parameters to the 
unknown quantities so that I is fully observed, conditional on the unknown parameters.) 
The symbols yops and Ymis refer to the collection of elements of y that are observed and 
missing. The sample space is the product of the usual sample space for y and the sample 
space for J. Thus, in this chapter, and also in Chapter 18, we use the notation yops where 
in the other chapters we would use y. 

For much of this chapter, we assume that the simple 0/1 indicator J;; is adequate for 
summarizing the possible responses; Section 8.7 considers models in which missing data pat- 
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terns are more complicated and the data points cannot simply be categorized as ‘observed’ 
or ‘missing.’ 


Stability assumption 


It is standard in statistical analysis to assume stability: that the recording or measurement 
process does not change the values of the data. In experiments, the assumption is called 
the stable unit treatment value assumption and includes the assumption of no interference 
between units: the treatment applied to any particular unit should have no effect on out- 
comes for the other units. More generally, the assumption is that the complete-data vector 
(or matrix) y is not affected by the inclusion vector (or matrix) J. An example in which 
this assumption fails is an agricultural experiment that tests several fertilizers on plots so 
closely spaced that the fertilizers leach to neighboring plots. 

In defining y as a fixed quantity, with J only affecting which elements of y are observed, 
our notation implicitly assumes stability. If instability is a possibility, then the notation 
must be expanded to allow all possible outcome vectors in a larger ‘complete data’ structure 
y. In order to control the computational burden, such a structure is typically created based 
on some specific model. We do not consider this topic further except in Exercise 8.4. 


Fully observed covariates 


In this chapter, we use the notation x for variables that are fully observed for all units. 
There are typically three reasons why we might want to include x in an analysis: 


1. We may be interested in some aspect of the joint distribution of (x,y), such as the 
regression of y on a. 


2. We may be interested in some aspect of the distribution of y, but x provides information 
about y: in a regression setting, if x is fully observed, then a model for p(y|a) can lead 
to more precise inference about new values of y than would be obtained by modeling y 
alone. 


3. Even if we are only interested in y, we must include «x in the analysis if x is involved in the 
data collection mechanism or, equivalently, if the distribution of the inclusion indicators 
I depends on x. Examples include stratum indicators in sampling or block indicators in 
a randomized block experiment. We return to this topic several times in this chapter. 


Data model, inclusion model, and complete and observed data likelihood 


It is useful when considering data collection to break the joint probability model into two 
parts: (1) the model for the underlying complete data, y—including observed and unob- 
served components—and (2) the model for the inclusion vector, I. We define the complete- 
data likelihood as the product of the likelihoods of these two factors; that is, the distribution 
of the complete data, y, and the inclusion vector, J, given the parameters in the model: 


p(y, 110, ġ) = p(yl@)pU|y, $). (8.1) 


In this chapter, we use 0 and ¢ to denote the parameters of the distributions of the complete 
data and the inclusion vectors, respectively. 

In this formulation, the first factor of (8.1), p(y|@), is a model of the underlying data 
without reference to the data collection process. For most problems we shall consider, 
the estimands of primary interest are functions of the complete data y (finite-population 
estimands) or of the parameters 6 (superpopulation estimands). The parameters ¢ that 
index the missingness model are characteristic of the data collection but are not generally 
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of scientific interest. It is possible, however, for 0 and ¢ to be dependent in their prior 
distribution or even to be deterministically related. 

Expression (8.1) is useful for setting up a probability model, but it is not actually the 
‘likelihood’ of the data at hand unless y is completely observed. The actual information 
available is (yobs, J), and so the appropriate likelihood for Bayesian inference is 


plone; 110, 4) = i ply 110; Pda 


which we call the observed-data likelihood. If fully observed covariates x are available, all 
these expressions are conditional on zx. 


Joint posterior distribution of parameters 0 from the sampling model and @ from the 
missing-data model 


The complete-data likelihood of (y, I), given parameters (0,¢) and covariates x, is 


ply, I|x, 0, o) = ply|z, @)p(I|z, y, $), 


where the pattern of missing data can depend on the complete data y (both observed and 
missing), as in the cases of censoring and truncation presented in Section 8.7. The joint 
posterior distribution of the model parameters 0 and ¢, given the observed information, 
(2, Yoos, L), is 


PO, P|, Yoos, T) x PO, d|2)p(Yoos, Tlx, 0, Q) 
p(0, dla) | p(y, te, 0, 0)dumis 


p(0, dx) / pyle, Ople, y, #) deme. 


The posterior distribution of 0 alone is this expression averaged over ¢: 


AE E J l piole, Oplyle, Ople, y, ¢) died. (8.2) 


As usual, we will often avoid evaluating these integrals by simply drawing posterior simu- 
lations of the joint vector of unknowns, (Ymis, 0, @) and then focusing on the estimands of 
interest. 


Finite-population and superpopulation inference 


As indicated earlier, we distinguish between two kinds of estimands: finite-population quan- 
tities and superpopulation quantities, respectively. In the terminology of the earlier chap- 
ters, finite-population estimands are unobserved but often observable, and so sometimes 
may be called predictable. It is usually convenient to divide our analysis and computa- 
tion into two steps: superpopulation inference—that is, analysis of p(8, |£, Yous, I) —and 
finite-population inference using p(Ymis|£, Yobs, Z, 0, Q). Posterior simulations of ymis from 
its posterior distribution are called multiple imputations and are typically obtained by first 
drawing (0, ġ) from their joint posterior distribution and then drawing ymis from its condi- 
tional posterior distribution given (6,¢). Exercise 3.5 provides a simple example of these 
computations for inference from rounded data. 

If all units were observed fully—for example, sampling all units in a survey or apply- 
ing all treatments (counterfactually) to all units in an experiment—then finite-population 
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quantities would be known exactly, but there would still be some uncertainty in superpopu- 
lation inferences. In situations in which a large fraction of potential observations are in fact 
observed, finite-population inferences about ymis are more robust to model assumptions, 
such as additivity or linearity, than are superpopulation inferences about 0. In a model 
of y given x, the finite-population inferences depend only on the conditional distribution 
for the particular set of x’s in the population, whereas the superpopulation inferences are, 
implicitly, statements about the infinity of unobserved values of y generated by p(y|@, x). 

Estimands defined from predictive distributions are of special interest because they are 
not tied to any particular parametric model and are therefore particularly amenable to 
sensitivity analysis across different models with different parameterizations. 


Posterior predictive distributions. When considering prediction of future data, or repli- 
cated data for model checking, it is useful to distinguish between predicting future complete 
data, y, and predicting future observed data, fobs. The former task is, in principle, easier 
because it depends only on the complete data distribution, p(y|a, 0), and the posterior dis- 
tribution of 0, whereas the latter task depends also on the data collection mechanism—that 
is, p(I|x, y, $). 


Ignorability 


If we decide to ignore the data collection process, we can compute the posterior distribution 
of 0 by conditioning only on yops but not I: 


p(A|x, Yoos) xX p(O|x)p(Yoos|x, 0) 
= p(6lz) J plue, O)dymis: (8.3) 


When the missing data pattern supplies no information—when the function p(0|x, Yobs) 
given by (8.3) equals p(8|z, yoos, Z) given by (8.2)—the study design or data collection 
mechanism is called ignorable (with respect to the proposed model). In this case, the 
posterior distribution of 0 and the posterior predictive distribution of ymis (for example, 
future values of y) are entirely determined by the specification of a data model—that is, 
p(y|x, 0)p(0|xr)—and the observed values of yops- 


‘Missing at random’ and ‘distinct parameters’ 


Two general and simple conditions are sufficient to ensure ignorability of the missing data 
mechanism for Bayesian analysis. First, the condition of missing at random requires that 


pl |x, y, 9) = p(|z, Yobs, O); 


that is, p(I|x,y,ġ), evaluated at the observed value of (x, I, Yobs), must be free of ymis — 
that is, given observed x and Yobs, it is a function of ¢ alone. ‘Missing at random’ is a 
subtle term, since the required condition is that, given ¢, missingness depends only on x 
and Yobs. For example, a deterministic inclusion rule that depends only on g is ‘missing at 
random’ under this definition. (An example of a deterministic inclusion rule is auditing all 
tax returns with declared income greater than $1 million, where ‘declared income’ is a fully 
observed covariate, and ‘audited income’ is y.) 

Second, the condition of distinct parameters is satisfied when the parameters of the 
missing data process are independent of the parameters of the data generating process in 
the prior distribution: 

p(g|x, 0) = p(ġ|z). 
Models in which the parameters are not distinct are considered in an extended example in 
Section 8.7. 
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The important consequence of these definitions is that, from (8.2), it follows that if data 
are missing at random according to a model with distinct parameters, then p(6|x, Yoos) = 
p(O|x, Yous, T), that is, the missing data mechanism is ignorable. 


Ignorability and Bayesian inference under different data-collection schemes 


The concept of ignorability supplies some justification for a relatively weak version of the 
claim presented at the outset of this chapter that, with fixed data and fixed models for 
the data, the data collection process does not influence Bayesian inference. That result is 
true for all ignorable designs when ‘Bayesian inference’ is interpreted strictly to refer only 
to the posterior distribution of the estimands with one fixed model (both prior distribution 
and likelihood), conditional on the data, but excluding both sensitivity analyses and pos- 
terior predictive checks. Our notation also highlights the incorrectness of the claim for the 
irrelevance of study design in general: even with a fixed likelihood function p(y|@), prior dis- 
tribution p(@), and data y, the posterior distribution does vary with different nonignorable 
data collection mechanisms. 

In addition to the ignorable/nonignorable classification for designs, we also distinguish 
between known and unknown mechanisms, as anticipated by the discussion in the exam- 
ples of Section 8.7 contrasting inference with specified versus unspecified truncation and 
censoring points. The term ‘known’ includes data collection processes that follow a known 
parametric family, even if the parameters ¢ are unknown 


Designs that are ignorable and known with no covariates, including simple random sampling 
and completely randomized experiments. The simplest data collection procedures are those 
that are ignorable and known, in the sense that 


pl |x, y, $) = p(l). (8.4) 


Here, there is no unknown parameter ¢ because there is a single accepted specification 
for p(I) that does not depend on either x or yops. Only the complete-data distribution, 
p(y|x,@), and the prior distribution for 0, p(@), need be considered for inference. The 
obvious advantage of such an ignorable design is that the information from the data can be 
recovered with a relatively simple analysis. We begin Section 8.3 on surveys and Section 
8.4 on experiments by considering basic examples with no covariates, x, which document 
this connection with standard statistical practice and also show that the potential data 
Y = (Yobs; Ymis) Can usefully be regarded as having many components, most of which we 
never expect to observe but can be used to define ‘finite-population’ estimands. 


Designs that are ignorable and known given covariates, including stratified sampling and ran- 
domized block experiments. In practice, simple random sampling and complete randomiza- 
tion are less common than more complicated designs that base selection and treatment deci- 
sions on covariate values. It is not appropriate always to pretend that data yops1,---;Yobsn 
are collected as a simple random sample from the target population, y1,..., yy. A key idea 
for Bayesian inference with complicated designs is to include in the model p(y|a, 0) enough 
explanatory variables x so that the design is ignorable. With ignorable designs, many of 
the models presented in other chapters of this book can be directly applied. We illustrate 
this approach with several examples that show that not all known ignorable data collection 
mechanisms are equally good for all inferential purposes. 


Designs that are strongly ignorable and known. A design that is strongly ignorable (some- 
times called unconfounded) satisfies 


pl |z,y,%) = pU|z), 


so that the only dependence is on fully observed covariates. (For an example of a design that 
is ignorable but not strongly ignorable, consider a sequential experiment in which the unit- 
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level probabilities of assignment depend on the observed outcomes of previously assigned 
units.) We discuss these designs further in the context of propensity scores below. 


Designs that are ignorable and unknown, such as experiments with nonrandom treatment 
assignments based on fully observed covariates. When analyzing data generated by an un- 
known or nonrandomized design, one can still ignore the assignment mechanism if p(I|a, y, o) 
depends only on fully observed covariates, and @ is distinct from ¢. The ignorable analysis 
must be conditional on these covariates. We illustrate with an example in Section 8.4, with 
strongly ignorable (given x) designs that are unknown. Propensity scores can play a useful 
role in the analysis of such data. 


Designs that are nonignorable and known, such as censoring. ‘There are many settings in 
which the data collection mechanism is known (or assumed known) but is not ignorable. 
Two simple but important examples are censored data (some of the variations in Section 
8.7) and rounded data (see Exercises 3.5 and 8.14). 


Designs that are nonignorable and unknown. This is the most difficult case. For example, 
censoring at an unknown point typically implies great sensitivity to the model specification 
for y (discussed in Section 8.7). For another example, consider a medical study of several 
therapies, j = 1,...,J, in which the treatments that are expected to have smaller effects 
are applied to larger numbers of patients. The sample size of the experiment corresponding 
to treatment j, nj, would then be expected to be correlated with the efficacy, 0j, and 
so the parameters of the data collection mechanism are not distinct from the parameters 
indexing the data distribution. In this case, we should form a parametric model for the 
joint distribution of (nj, yj). Nonignorable and unknown data collection are standard in 
observational studies, as we discuss in Section 8.6, where initial analyses typically treat the 
design as ignorable but unknown. 


Propensity scores 


With a strongly ignorable design, the unit-level probabilities of assignments, 
Pri; = 1|X) = Ti; 


are called propensity scores. In some strongly ignorable designs, the vector of propensity 
scores, 7 = (T1,..., Ty), is an adequate summary of the covariates X in the sense that the 
assignment mechanism is strongly ignorable given just m rather than x. Conditioning on 7 
rather than multivariate x may lose some precision but it can greatly simplify modeling and 
produce more robust inferences. Propensity scores also create another bridge to classical 
designs. 

However, propensity scores alone never supply enough information for posterior predic- 
tive replications (which are crucial to model checking, as discussed in Sections 6.3-6.4), 
because different designs can have the same propensity scores. For example, a completely 
randomized experiment with m; = 4 for all units has the same propensity scores as an ex- 
periment with independent assignments with probability 4 for each treatment for each unit. 
Similarly, simple random sampling has the same propensity scores as some more elaborate 
equal-probability sampling designs involving stratified or cluster sampling. 


Unintentional missing data 


A ubiquitous problem in real datasets is unintentional missing data, which we discuss in 
more detail in Chapter 18. Classical examples include survey nonresponse, dropouts in 
experiments, and incomplete information in observational studies. When the amount of 
missing data is small, one can often perform a good analysis assuming that the missing 
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data are ignorable (conditional on the fully observed covariates). As the fraction of missing 
information increases, the ignorability assumption becomes more critical. In some cases, 
half or more of the information is missing, and then a serious treatment of the missingness 
mechanism is essential. For example, in observational studies of causal effects, at most half 
of the potential data (corresponding to both treatments applied to all units in the study) 
are observed, and so it is necessary to either assume ignorability (conditional on available 
covariates) or explicitly model the treatment selection process. 


8.3 Sample surveys 
Simple random sampling of a finite population 


For perhaps the simplest nontrivial example of statistical inference, consider a finite popu- 
lation of N persons, where y; is the weekly amount spent on food by the ith person. Let 
y = (y1,---,yn), where the object of inference is average weekly spending on food in the 
population, 7. As usual, we consider the finite population as N exchangeable units; that 
is, we model the marginal distribution of y as an independent and identically distributed 
mixture over the prior distribution of underlying parameters 0: 


N 
p(y) = J [[pwilar@)as. 


The estimand, y, will be estimated from a sample of y;-values because a census of all 
N units is too expensive for practical purposes. A standard technique is to draw a simple 
random sample of specified size n. Let I = (,...,Jn) be the vector of indicators for 
whether or not person 7 is included in the sample: 


L= 1 if 7 is sampled 
"| 0 otherwise. 


Formally, simple random sampling is defined by 


panied (D7 točen 


otherwise. 


This method is strongly ignorable and known (compare to equation (8.4)) and therefore is 
straightforward to deal with inferentially. The probability of inclusion in the sample (the 
propensity score) is m; = + for all units. 


Bayesian inference for superpopulation and finite-population estimands. We can perform 
Bayesian inference applying the principles of the early chapters of this book to the posterior 
density, p(0|Yobs, Z), which under an ignorable design is p(4|Yoos) x p(?)p(Yovs|9). As usual, 
this requires setting up a model for the distribution of weekly spending on food in the 
population in terms of parameters 0. For this problem, however, the estimand of interest is 
the finite-population average, Y, which can be expressed as 


a ne N-n_ 
y= W obs + Zp “mis (8.5) 
where Yops and Ymis are the averages of the observed and missing y;’s. 
We can determine the posterior distribution of y using simulations of Ymis from its 
posterior predictive distribution. We start with simulations of 0: 05,s =1,...,S. For each 
drawn 6° we then draw a vector ymis from 


P(Ymis|O°; Yous) = P(Ymis|9°) = [] p(yil6*), 
a: 15=0 
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and then average the values of the simulated vector to obtain a draw of Ymis from its 
posterior predictive distribution. Because Jops is known, we can compute draws from the 
posterior distribution of the finite-population mean, J, using (8.5) and the draws of Ymis- 
Although typically the estimand is viewed as y, more generally it could be any function of 
y (such as the median of the y; values, or the mean of log y;). 


Large-sample equivalence of superpopulation and finite-population inference. If N—n is 
large, then we can use the central limit theorem to approximate the sampling distribution 


of Gis: 
l 2 
ny”) 


where u= u(0)=E(y;|0) and o? =07(0) =var(y;|0). If n is large as well, then the posterior 
distributions of 0 and any of its components, such as u and g, are approximately normal, 
hence the posterior distribution of Y,,;, is approximately a normal mixture of normals and 
thus normal itself. More formally, 


PYmis|9) xN (Tas 


PQ mis|Yobs) od [Goal o)p(H, o|Yoos )dudo; 
as both N and n get large with N/n fixed, this is approximately normal with 


E(Gimis|Yobs) S E(u|Yobs) oe Yoos: 


and 
= 1 2 
val (misl Yobs) ~ var (Lt Yoos) +E N a Yobs 

is IRO 

© —Sobs F Wn 0s 

N 2 

= — s 

n(N — n) obs? 


where s?,,, is the sample variance of the observed values of yobs. Combining this approximate 
posterior distribution with (8.5), it follows not only that p(Jlyobs) > p(ulYobs), but, more 


generally, that 
= z 1 1 5 
YlYobs xN Yobs> in = N Sobs | - (8.6) 


This is the formal Bayesian justification of normal-theory inference for finite sample surveys. 
For p(y;|9) normal, with the standard noninformative prior distribution, the exact result 
is JlYobs ~ tn—1 (Joos: (= — F) $2ps)- (See Exercise 8.7.) 


Stratified sampling 


In stratified random sampling, the N units are divided into J strata, and a simple ran- 
dom sample of size nj is drawn using simple random sampling from each stratum j = 
1,...,J. This design is ignorable given J vectors of indicator variables, 71,...,2 7, with 
Tj = (zij, er , Inj) and 


re 1 if unit 2 is in stratum j 
‘I | O otherwise. 


The variables x; are effectively fully observed in the population as long as we know, for each 
j, the number of units N; in the stratum and the values of «;; for units in the sample. A 
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Proportion who prefer ... Sample 
Stratum, j Bush, Dukakis, no opinion, proportion, 
Yoos1j/Nj _ Yobs2j/Nj __ Yobs3j/Nj nj/n 
Northeast, I 0.30 0.62 0.08 0.032 
Northeast, II 0.50 0.48 0.02 0.032 
Northeast, III 0.47 0.41 0:12 0.115 
Northeast, IV 0.46 0.52 0.02 0.048 
Midwest, I 0.40 0.49 0.11 0.032 
Midwest, II 0.45 0.45 0.10 0.065 
Midwest, III 0.51 0.39 0.10 0.080 
Midwest, IV 0.55 0.34 0.11 0.100 
South, I 0.57 0.29 0.14 0.015 
South, II 0.47 0.41 0.12 0.066 
South, II 0.52 0.40 0.08 0.068 
South, IV 0.56 0.35 0.09 0.126 
West, I 0.50 0.47 0.03 0.023 
West, II 0.53 0.35 0.12 0.053 
West, III 0.54 0.37 0.09 0.086 
West, IV 0.56 0.36 0.08 0.057 


Table 8.2 Results of a CBS News survey of 1447 adults in the United States, divided into 16 
strata. The sampling is assumed to be proportional, so that the population proportions, N;/N, are 
approximately equal to the sampling proportions, n;/n. 


natural analysis is to model the distributions of the measurements y; within each stratum j 
in terms of parameters #; and then perform Bayesian inference on all the sets of parameters 
0,,...,9;. For many applications it will be natural to assign a hierarchical model to the 
0;’s, yielding a problem with structure similar to the rat tumor experiments, the educational 
testing experiments, and the meta-analysis of Chapter 5. We illustrate this approach with 
an example at the end of this section. 

We obtain finite-population inferences by weighting the inferences from the separate 
strata in a way appropriate for the finite-population estimand. For example, we can write 
the population mean, y, in terms of the individual stratum means, Yj, as Y = Dii y. 
The finite-population quantities 7; can be simulated given the simulated parameters 6; 
for each stratum. There is no requirement that = Ti, the finite-population Bayesian 
inference automatically corrects for any oversampling or undersampling of strata. 


Example. Stratified sampling in pre-election polling 

We illustrate the analysis of a stratified sample with the opinion poll introduced in 
Section 3.4, in which we estimated the proportion of registered voters who supported 
the two major candidates in the 1988 U.S. presidential election. In Section 3.4, we 
analyzed the data under the false assumption of simple random sampling. Actually, 
the CBS survey data were collected using a variant of stratified random sampling, 
in which all the primary sampling units (groups of residential phone numbers) were 
divided into 16 strata, cross-classified by region of the country (Northeast, Midwest, 
South, West) and density of residential area (as indexed by telephone exchanges). The 
data and relative size of each stratum in the sample are given in Table 8.2. For the 
purposes of this example, we assume that respondents are sampled at random within 
each stratum and that the sampling fractions are exactly equal across strata, so that 
N/N =n,/n for each stratum j. 

Complications arise from several sources, including the systematic sampling used 
within strata, the selection of an individual to respond from each household that is 
contacted, the number of times people who are not at home are called back, the use of 
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fo ial 
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Figure 8.1 Values of D IN 
the election polling example, based on (a) the simple nonhierarchical model and (b) the hierarchical 
model. Compare to Figure 3.2. 


i (@ıj — 02;) for 1000 simulations from the posterior distribution for 


demographic adjustments, and the general problem of nonresponse. For simplicity, we 
ignore many complexities in the design and restrict our attention here to the Bayes- 
ian analysis of the population of respondents answering residential phone numbers, 
thereby assuming the nonresponse is ignorable and also avoiding the additional step 
of shifting to the population of registered voters. (Exercises 8.12 and 8.13 consider 
adjusting for the fact that the probability an individual is sampled is proportional to 
the number of telephone lines in his or her household and inversely proportional to 
the number of adults in the household.) A more complete analysis would control for 
covariates such as sex, age, and education that affect the probabilities of nonresponse. 


Data distribution. To justify the use of an ignorable model, we must model the 
outcome variable conditional on the explanatory variables—region of the country and 
density of residential area—that determine the stratified sampling. We label the strata 
j=1,...,16, with nj out of Nj drawn from each stratum, and a total of N = ae Nj 
registered voters in the population. We fit the multinomial model of Section 3.4 
within each stratum and a hierarchical model to link the parameters across different 
strata. For each stratum j, we label Yobsj = (Yobs1j, Yobs2j; Yobs3j), the number 
of supporters of Bush, Dukakis, and other/no-opinion in the sample, and we model 
Yobsj ™ Multin(n,;; Orj, O2j, 035). 

A simple nonhierarchical model. The simplest analysis of these data assigns the 16 
vectors of parameters (014, 02;, 03;) independent prior distributions. At this point, the 
Dirichlet model is a convenient choice of prior distribution because it is conjugate 
to the multinomial likelihood (see Section 3.4). Assuming Dirichlet prior distribu- 
tions with all parameters equal to 1, we can obtain posterior inferences separately 
for the parameters in each stratum. The resulting simulations of 0;;’s constitute the 
‘superpopulation inference’ under this model. 


Finite-population inference. As discussed in the earlier presentation of this example 
in Section 3.4, an estimand of interest is the difference in the proportions of Bush and 
Dukakis supporters, that is, 


16 
> A (01; — 62) (8.7) 


which we can easily compute using the posterior simulations of 0 and the known 
values of N;/N. In the above formula, we have used superpopulation means 6;,; in 
place of population averages 7,;, but given the huge size of the finite populations in 
question, this is not a concern. We are in a familiar setting in which finite-population 
and superpopulation inferences are essentially identical. The results of 1000 draws 
from the posterior distribution of (8.7) are displayed in Figure 8.la. The distribution 
is centered at a slightly smaller value than in Figure 3.2, 0.097 versus 0.098, and is 
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slightly more concentrated about the center. The former result is likely a consequence 
of the choice of prior distribution. The Dirichlet prior distribution can be thought of 
as adding a single voter to each multinomial category within each of the 16 strata— 
adding a total of 16 to each of the three response categories. This differs from the 
nonstratified analysis in which only a single voter was added to each of the three 
multinomial categories. A Dirichlet prior distribution with parameters equal to $ 
would reproduce the same median as the nonstratified analysis. The smaller spread 
is expected and is one of the reasons for taking account of the stratified design in the 


analysis. 


A hierarchical model. The simple model with prior independence across strata allows 
easy computation, but, as discussed in Chapter 5, when presented with a hierarchi- 
cal dataset, we can improve inference by estimating the population distributions of 
parameters using a hierarchical model. 

As an aid to constructing a reasonable model, we transform the parameters to separate 
partisan preference and probability of having a preference: 


o ij _ probability of preferring Bush, given 
a a 01; +02; that a preference is expressed 
a2; = 1-63; = probability of expressing a preference. (8.8) 


We then transform these parameters to the logit scale (because they are restricted to 
lie between 0 and 1), 


bij = logit(a1;) and 22; = logit(a2;), 


and model them as exchangeable across strata with a bivariate normal distribution 
indexed by hyperparameters (11, H2, T1, T2, p): 


16 
Bij Hı T? PTT2 
? i ? 2 = N ? 1 ? 
PUR Ta Ul (& H2 PT1T2 T 


with the conditionals p(81;, B2;|H1, H2, T1, T2, p) independent for j = 1,...,16. This 
model, which is exchangeable in the 16 strata, does not use all the available prior 
information, because the strata are actually structured in a 4x 4 array, but it improves 
upon the nonhierarchical model (which is equivalent to the hierarchical model with 
the parameters 7, and 72 fixed at oo). We complete the model by assigning a uniform 
prior density to the hierarchical mean and standard deviation parameters and to the 
correlation of the two logits. 


Results under the hierarchical model. Posterior inference for the 37-dimensional pa- 
rameter vector (3, H1, 42,01, 02,T) is conceptually straightforward but requires com- 
putational methods beyond those developed in Parts I and II of this book. For the 
purposes of this example, we present the results of sampling using the Metropolis 
algorithm; see Exercise 11.7. 

Table 8.3 provides posterior quantities of interest for the hierarchical parameters and 
for the parameters a,;, the proportion preferring Bush (among those who have a 
preference) in each stratum. The posterior medians of the a,,;’s vary from 0.48 to 
0.59, representing considerable shrinkage from the proportions in the raw counts, 
Yobs 1; /(Yobs1j + Yobs2;), Which vary from 0.33 to 0.67. The posterior median of p, the 
between-stratum correlation of logit of support for Bush and logit of proportion who 
express a preference, is negative, but the posterior distribution has substantial vari- 
ability. Posterior quantiles for the probability of expressing a preference are displayed 
for only one stratum, stratum 16, as an illustration. 
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Posterior quantiles 
Estimand 2.5% 25% median 75% 97.5% 
Stratum 1, a1,1 0.34 0.43 0.48 0.52 0.57 
Stratum 2, a4,2 0.42 0.50 0.53 0.56 0.61 
Stratum 3, a1,3 0.48 0.52 0.54 0.56 0.60 
Stratum 4, a4,4 0.41 0.47 0.50 0.54 0.58 
Stratum 5, a15 0.39 0.49 0.52 0.55 0.61 
Stratum 6, a1,6 0.44 0.50 0.53 0.55 0.64 
Stratum 7, 1,7 0.48 0.53 0.56 0.58 0.63 
Stratum 8, a1,8 0.52 0.56 0.59 0.61 0.66 
Stratum 9, a1,9 0.47 0.54 0.57 0.61 0.69 
Stratum 10, a1,10 0.47 0.52 0.55 0.57 0.61 
Stratum 11, &@1,11 0.47 0.53 0.56 0.58 0.63 
Stratum 12, a1,12 0.53 0.56 0.58 0.61 0.65 
Stratum 13, a1,13 0.43 0.50 0.53 0.56 0.63 
Stratum 14, &1,14 0.50 0.55 0.57 0.60 0.67 
Stratum 15, &1,15 0.50 0.55 0.58 0.59 0.65 
Stratum 16, a1,16 0.50 0.55 0.57 0.60 0.65 
Stratum 16, a2,16 0.87 0.90 0.91 0.92 0.94 
logit" (u1) 0.50 0.53 0.55 0.56 0.59 
logit ~' (u2) 0.89 0.91 0.91 0.92 0.93 
Ti 0.11 0.17 0.23 0.30 0.47 
T2 0.14 0.20 0.28 0.40 0.78 
p —0.92 —0.71 —0.44 0.02 0.75 


Table 8.3 Summary of posterior inference for the hierarchical analysis of the CBS survey in Table 
8.2. The posterior distributions for the aıj’s vary from stratum to stratum much less than the raw 
counts do. The inference for a2,16 for stratum 16 is included above as a representative of the 16 
parameters &z2j. The parameters mı and u2 are transformed to the inverse-logit scale so they can 
be more directly interpreted. 


Posterior inference for the population total (8.7) is displayed on page 208 in Figure 
8.1b which, compared to Figures 8.la, indicates that the hierarchical model yields a 
higher posterior median, 0.11, and slightly more variability. The higher median occurs 
because the groups within which support for Bush was lowest, strata 1 and 5, have 
relatively small samples (see Table 8.2) and so are pulled more toward the grand mean 
in the hierarchical analysis (see Table 8.3). 


Model checking. The large shrinkage of the extreme values in the hierarchical analysis 
is a possible cause for concern, considering that the observed support for Bush in strata 
1 and 5 was so much less than in the other 14 strata. Perhaps the normal model for the 
distribution of true stratum parameters (1; is inappropriate? We check the fit using 
a posterior predictive check, using as a test statistic T(y) = min; Yoos1;/n;, which 
has an observed value of 0.298 (occurring in stratum 1). Using 1000 draws from the 
posterior predictive distribution, we find that T(y™°P) varies from 0.14 to 0.48, with 
163° replicated values falling below 0.298. Thus, the extremely low value is plausible 


1000 
under the model. 


Cluster sampling 


In cluster sampling, the N units in the population are divided into K clusters, and sampling 
proceeds in two stages. First, a sample of J clusters is drawn, and second, a sample of nj 
units is drawn from the N; units within each sampled cluster j = 1,..., J. This design is 
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ignorable given indicator variables for the J clusters and knowledge of the number of units 
in each of the J clusters. Analysis of a cluster sample proceeds as for a stratified sample, 
except that inference must be extended to the clusters not sampled, which correspond to 
additional exchangeable batches in a hierarchical model (see Exercise 8.9). 


Example. A survey of Australian schoolchildren 

We illustrate with a study estimating the proportion of children in the Melbourne area 
who walked to school. The survey was conducted by first sampling 72 schools from the 
metropolitan area and then surveying the children from two classes randomly selected 
within each school. The schools were selected from a list, with probability of selection 
proportional to the number of classes in the school. This ‘probability proportional to 
size’ sampling is a classical design for which the probabilities of selection are equal for 
each student in the city, and thus the average of the sample is an unbiased estimate 
of the population mean (see Section 4.5). The Bayesian analysis for this design is 
more complicated but has the advantage of being easily generalized to more elaborate 
inferential settings such as regressions. 


Notation for cluster sampling. We label students, classes, and schools as i, j, k, re- 
spectively, with Nj, students within class j in school k, Mp classes within school k, 
and K schools in the city. The number of students in school k is then Ny, = Sa Nik, 
with a total of N = a N; students in the city. 

We define y;;, to equal 1 if student i (in class j and school k) walks to school and 0 
otherwise, 7 j4 = Nx eo Yijk to be the proportion of students in class j and school 


k who walk to school, and ŅJ..k = x Da Njx¥.jk to be the proportion of students in 


school k who walk. The estimand of interest is Y = + X`, Nz¥j..n, the proportion 
N 2uk=1 *VRY 
of students in the city who walk. 


The model. The general principle of modeling for cluster sampling is to include a 
parameter for each level of clustering. A simple model would start with independence 
at the student level within classes: Pr(yijz = 1) = 0;,. Assuming a reasonable number 
of students Nj, in the class, we approximate the distribution of the average in class j 
within school & as independent with distributions, 


Y.jk Gs N (0k, 03k) (8.9) 


where T3; = 9 jr(1 — Y.jk)/Njk. We could perform the computations with the exact 
binomial model, but for our purposes here, the normal approximation allows us to lay 
out more clearly the structure of the model for cluster data. 

We continue by modeling the classes within each school as independent (conditional 
on school-level parameters) with distributions, 


Ojk f N (0k, Tass) 
ôk ~ N(a + BMk, TEnool). (8.10) 


We include the number of classes Mg in the school-level model because of the principle 
that all information used in the design should be included in the analysis. In this 
survey, the sampling probabilities depend on the school sizes Mk, and so the design is 
ignorable only for a model that includes the M;’s. The linear form (8.10) is only one 
possible way to do this, but it seems a reasonable place to start. 


Inference for the population mean. Bayesian inference for this problem proceeds in 

four steps: 

1. Fit the model defined by (8.9) and (8.10), along with a noninformative prior dis- 
tribution on the hyperparameters @, 3, Tclass; Tschool to obtain inferences for these 
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hyperparameters, along with the parameters 0; for the classes j in the sample and 
Ok for the schools k in the sample. 


2. For each posterior simulation from the model, use the model (8.10) to simulate the 
parameters 6;; and 6; for classes and schools that are not in the sample. 


3. Use the sampling model (8.9) to obtain inferences about students in unsampled 
classes and unsampled schools, and then combine these to obtain inferences about 
the average response in each school, ¥..4 = i eres Njx¥.jk- For unsampled schools 
k, J..k is calculated directly from the posterior simulations of its Mp classes, or more 
directly simulated as, 


Y..k ~ N(a F BMk, Toodi + Tåass/ Mk + o° /Np). 


For sampled schools k, ŅJ..k is an average over sampled and unsampled classes. This 
can be viewed as a more elaborate version of the Yobs, Ymis calculations described 
at the beginning of Section 8.3. 


4. Combine the posterior simulations about the school-level averages to obtain infer- 
ence about the population mean, Y = + . Nk... To perform this sum (as 
well as the simulations in the previous step), we must know the number of children 
N; within each of the schools in the population. If these are not known, they must 
be estimated from the data—that is, we must jointly model (Nx, 6; )—as illustrated 


in the Alcoholics Anonymous sampling example below. 


Approximate inference when the fraction of clusters sampled is small. For inference 
about schools in the Melbourne area, the superpopulation quantities 0;, and 0; can be 
viewed as intermediate quantities, useful for the ultimate goal of estimating the finite- 
population average Y, as detailed above. In general practice, however (and including 
this particular survey), the number of schools sampled is a small fraction of the total 
in the city, and thus we can approximate the average of any variable in the population 
by its mean in the superpopulation distribution. In particular, we can approximate 
Y, the overall proportion of students in the city who walk, by the expectations in the 
school-level model: 


ae es ie 
Yee 2 NECs) = XO Ne(a + Mg). (8.11) 


This quantity of interest has a characteristic feature of sample survey inference, that 
it depends both on model parameters and cluster sizes. Since we are now assuming 
that only a small fraction of schools are included in the sample, the calculation of 
(8.11) depends only on the hyperparameters a, 8 and the distribution of (Mp, Nx) in 
the population, and not on the school-level parameters 6. 


Unequal probabilities of selection 


Another simple survey design is independent sampling with unequal sampling probabilities 
for different units. This design is ignorable conditional on the covariates, x, that deter- 
mine the sampling probability, as long as the covariates are fully observed in the general 
population (or, to put it another way, as long as we know the values of the covariates in 
the sample and also the distribution of x in the general population—for example, if the 
sampling probabilities depend on sex and age, the number of young females, young males, 
old females, and old males in the population). The critical step then in Bayesian modeling 


is formulating the conditional distribution of y given z. 


8. MODELING ACCOUNTING FOR DATA COLLECTION 
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Example. Sampling of Alcoholics Anonymous groups 

Approximately every three years, Alcoholics Anonymous performs a survey of its mem- 
bers in the United States, first selecting a number of meeting groups at random, then 
sending a packet of survey forms to each group and asking all the persons who come 
to the meeting on a particular day to fill out the survey. For the 1997 survey, groups 
were sampled with equal probability, and 400 groups throughout the nation responded, 
sending back an average of 20 surveys each. 

For any binary survey response y; (for example, the answer to the question, ‘Have 
you been an AA member for more than 5 years?’), we assume independent responses 
within each group j with probability 6; for a Yes response; thus, 


yj ~ Bin(n;, 45), 


where y; is the number of Yes responses and nj the total number of people who 
respond to the survey sent to members of group j. (This binomial model ignores the 
finite-population correction that is appropriate if a large proportion of the potential 
respondents in a group actually do receive and respond to the survey. In practice, it 
is acceptable to ignore this correction because, in this example, the main source of 
uncertainty is with respect to the unsampled clusters—a more precise within-cluster 
model would not make much difference.) 

For any response y, interest lies in the population mean; that is, 


L, (8.12) 


where x; represents the number of persons in group j, defined as the average number 
of persons who come to meetings of the group. To evaluate (8.12), we must perform 
inferences about the group sizes x; and the probabilities 0;, for all the groups j in the 
population—the 400 sampled groups and the 6000 unsampled. 

For Bayesian inference, we need a joint model for (xj, 04). We set this up conditionally: 
p(xj, 0j) = p(x;)p(0;|v;). For simplicity, we assign a joint normal distribution, which 
we can write as a normal model for x; and a linear regression for 6; given zj: 


Tj X N(ue, Ts) 
0; ~ Na+ Ba;, 7). (8.13) 


This is not the best parameterization since it does not account for the constraint 
that the probabilities 6; must lie between 0 and 1 and the group sizes xj must be 
positive. The normal model is convenient, however, for illustrating the mathematical 
ideas behind cluster sampling, especially if the likelihood is simplified as, 


yj/nj = N(8;,03), (8.14) 


where o? = yj(nj — yj) /n3. 
We complete the model with a likelihood on nj, which we assume has a Poisson 
distribution with mean ngj, where m represents the probability that a person will 
respond to the survey. For convenience, we can approximate the Poisson by a normal 
model: 

nj ee Nag Tas), (8.15) 
The likelihood (8.14) and (8.15), along with the prior distribution (8.13), and a non- 
informative uniform distribution on the hyperparameters a, B, Tx, To, T, represents a 
hierarchical normal model for the parameters («;,6;)—a bivariate version of the hi- 
erarchical normal model discussed in Chapter 5. (It would require only a little more 
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effort to compute with the exact posterior distribution using the binomial and Poisson 
likelihoods, with more realistic models for x; and 0;.) 

In general, the population quantity (8.12) depends on the observed and unobserved 
groups. In this example, the number of unobserved clusters is large, and so we can 
approximate (8.12) by 

E(2;0;) 


© = Eep 


From the normal model (8.13), we can express these expectations as posterior means 
of functions of the hyperparameters: 


2. p22 
Y= te + Bibs + Te) (8.16) 


Lax 


If average responses 0j and group sizes x; are independent then 8=0 and so (8.16) 
reduces to a. Otherwise, the estimate is corrected to reflect the goal of estimating the 
entire population, with each group counted in proportion to its size. 


In general, the analysis of sample surveys with unequal sampling probabilities can be 
difficult because of the large number of potential categories, each with its own model for 
the distribution of y given z and a parameter for the frequency of that category in the 
population. At this point, hierarchical models may be needed to obtain good results. 


8.4 Designed experiments 


In statistical terminology, an experiment involves the assignment of treatments to units, 
with the assignment under the control of the experimenter. In an experiment, statistical 
inference is needed to generalize from the observed outcomes to the hypothetical outcomes 
that would have occurred if different treatments had been assigned. Typically, the analysis 
also should generalize to other units not studied, that is, thinking of the units included in 
the experiment as a sample from a larger population. We begin with a strongly ignorable 
design—the completely randomized experiment—and then consider more complicated ex- 
perimental designs that are ignorable only conditional on information used in the treatment 
assignment process. 


Completely randomized experiments 


Notation for complete data. Suppose, for notational simplicity, that the number of units n 
in an experiment is even, with half to receive the basic treatment A and half to receive the 
new active treatment B. We define the complete set of observables as (yA, yP),i Sresi 
where y# and y? are the outcomes if the ith unit received treatment A or B, respectively. 
This model, which characterizes all potential outcomes as an n x 2 matrix, requires the 
stability assumption; that is, the treatment applied to unit 7 is assumed to have no effect 
on the potential outcomes in the other n — 1 units. 


Causal effects in superpopulation and finite-population frameworks. The A versus B causal 
effect for the ith unit is typically defined as yA — y?, and the overall causal estimand is 
typically defined to be the true average of the causal effects. In the superpopulation frame- 
work, the average causal effect is E(yA — yP|0) = E(yA|0) —E(yP |0), where the expectations 
average over the complete-data likelihood, p(y, y?|0). The average causal effect is thus a 
function of 0, with a posterior distribution induced by the posterior distribution of 0. 

The finite population causal effect is 74 — 7”, for the finite population under study. In 
many experimental contexts, the superpopulation estimand is of primary interest, but it is 
also nice to understand the connection to finite-population inference in sample surveys. 
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Inclusion model. The data collection indicator for an experiment is I = ((I[A,IP),i = 
1,...,n), where J; = (1,0) if the ith unit receives treatment A, J; = (0,1) if the ith unit 
receives treatment B, and I; = (0,0) if the ith unit receives neither treatment (for example, 
a unit whose treatment has not yet been applied). It is not possible for both JA and IP to 


equal 1. For a completely randomized experiment, 
=Í 


i-ai -| (Ma) if LTA = CLIP = p LA AIP for alli, 
0 otherwise. 

This treatment assignment is known and ignorable. The discussion of propensity scores on 
page 204 applied to the situation where J; = (0,0) could not occur, and so the notation 
becomes I; = IA and IP = 1- IA. 

Bayesian inference for superpopulation and finite-population estimands. Just as with sim- 
ple random sampling, inference is straightforward under a completely randomized experi- 
ment. Because the treatment assignment is ignorable, we can perform posterior inference 
about the parameters @ using p(0|yobs) « p(O)p(Yons|@). For example, under the usual 
independent and identically distributed mixture model, 


P(Yoos|4) = JI Cn ) II ply 


i: I; =(1,0) i: I;=(0,1) 


which allows for a Bayesian analysis applied separately to the parameters governing the 
marginal distribution of yA, say 04, and those of the marginal distribution of y?, say Op. 
(See Exercise 3.3, for example.) 

Once posterior simulations have been obtained for the superpopulation parameters, 0, 
one can obtain inference for the finite-population quantities (y4,.,y?,,) by drawing sim- 
ulations from the posterior predictive distribution, p(Yymis|9, Yous). The finite-population 
inference is trickier than in the sample survey example because only partial outcome infor- 
mation is available on the treated units. Drawing simulations from the posterior predictive 
distribution, 

P(Ymis|9; Yoos) = II Pur 8, yê) jo | Ply ^lo, yP) , 


i: I;=(1,0) i: Il;=(0,1) 


requires a model of the joint distribution, p(y#, y# |0). (See Exercise 8.8.) The parameters 
governing the joint distribution of (yA, y?), say 045, do not appear in the likelihood. 
The posterior distribution of 04g is found by averaging the conditional prior distribution, 
p(0aB|0a, 0B) over the posterior distribution of 64 and Oz. 


Large sample correspondence. Inferences for finite-population estimands such as 74 — 7? 
are sensitive to aspects of the joint distribution of yA and yë, such as corr(yA, y8 |0), for 
which no data are available (in the usual experimental setting in which each unit receives 
no more than one treatment). For large populations, however, the sensitivity vanishes if the 
causal estimand can be expressed as a comparison involving only 04 and 0g, the separate 
parameters for the outcome under each treatment. For example, suppose the n units are 
themselves randomly sampled from a much larger finite population of N units, and the 
causal effect to be estimated is the mean difference between treatments for all N units, 
7 — y”. Then it can be shown (see Exercise 8.8), using the central limit theorem, that the 
a distribution of the finite-population causal effect for large n and N/n is 


2 
(74 _ J” )|Yobs xN (T4, _ Tos» = (Sos + sB) ’ (8.17) 


where sł and s28 are the sample variances of the observed outcomes under the two 


treatments. The practical similarity between the Bayesian results and the repeated sampling 
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B: 257 E: 230 A:279 C: 287 D: 202 
D: 245 A: 283 E: 245 B:280 C: 260 
E: 182 B: 252 C:280 OD: 246 A: 250 
A: 203 C: 204 D: 227 E: 193 B: 259 
C: 231 D:271 B: 266 A: 334 E: 338 


Table 8.4 Yields of plots of millet arranged in a Latin square. Treatments A, B, C, D, E correspond 
to spacings of width 2, 4, 6, 8, 10 inches, respectively. Yields are in grams per inch of spacing. 
From Snedecor and Cochran (1989). 


randomization-based results is striking, but not entirely unexpected considering the relation 
between Bayesian and sampling-theory inferences in large samples, as discussed in Section 
4.4. 


Randomized blocks, Latin squares, etc. 


More complicated designs can be analyzed using the same principles, modeling the outcomes 
conditional on all factors that are used in determining treatment assignments. We present 
an example here of a Latin square; the exercises provide examples of other designs. 


Example. Latin square experiment 

Table 8.4 displays the results of an agricultural experiment with 25 plots (units) and 
five treatments labeled A, B, C, D, E. In our notation, the complete data y are a 25 x 5 
matrix with one entry observed in each row, and the indicator I is a fully observed 
25 x 5 matrix of zeros and ones. The estimands of interest are the average yields under 
each treatment. 


Setting up a model under which the design is ignorable. The factors relevant in the 
design (that is, affecting the probability distribution of I) are the physical locations 
of the 25 plots, which can be coded as a 25 x 2 matrix x of horizontal and vertical 
coordinates. Any ignorable model must be of the form p(y|x, 0). If additional relevant 
information were available (for example, the location of a stream running through the 
field), it should be included in the analysis. 


Inference for estimands of interest under various models. Under a model for p(y|x, 0), 
the design is ignorable, and so we can perform inference for 0 based on the likelihood 
of the observed data, p(Yons|, 0). 

The starting point would be a linear additive model on the logarithms of the yields, 
with row effects, column effects, and treatment effects. A more complicated model 
has interactions between the treatment effects and the geographic coordinates; for 
example, the effect of treatment A might increase going from the left to the right 
side of the field. Such interactions can be included as additional terms in the additive 
model. The analysis is best summarized in two parts: (1) superpopulation inference for 
the parameters of the distribution of y given the treatments and zx, and (2) finite-pop- 
ulation inference obtained by averaging the distributions in the first step conditional 
on the values of x in the 25 plots. 


Relevance of the Latin square design. Now suppose we are told that the data in Table 
8.4 actually arose from a completely randomized experiment that just happened to 
be balanced in rows and columns. How would this affect our analysis? Actually, 
our analysis should not be affected at all, as it is still desirable to model the plot 
yields in terms of the plot locations as well as the assigned treatments. Nevertheless, 
under a completely randomized design, the treatment assignment would be ignorable 
under a simpler model of the form p(y|@), and so a Bayesian analysis ignoring the 
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plot locations would yield valid posterior inferences (without conditioning on plot 
location). However, the analysis conditional on plot location is more relevant given 
what we know, and would tend to yield more precise inference assuming the true 
effects of location on plot yields are modeled appropriately. This point is explored 
more fully in Exercise 8.6. 


Sequential designs 


Consider a randomized experiment in which the probabilities of treatment assignment for 
unit 7 depend on the results of the randomization or on outcomes for previously treated 
units. Appropriate Bayesian analysis of sequential experiments is sometimes described as 
essentially impossible and sometimes described as trivial, in that the data can be analyzed 
as if the treatments were assigned completely at random. Neither of these claims is true. A 
randomized sequential design is ignorable conditional on all the variables used in determining 
the treatment allocations, including time of entry in the study and the outcomes of any 
previous units that are used in the design. See Exercise 8.15 for a simple example. A 
sequential design is not strongly ignorable. 


Including additional predictors beyond the minimally adequate summary 


From the design of a randomized study, we can determine the minimum set of explanatory 
variables required for ignorability; this minimal set, along with the treatment and outcome 
measurements, is called an adequate summary of the data, and the resulting inference is 
called a minimally adequate summary or simply a minimal analysis. As suggested earlier, 
sometimes the propensity scores can be such an adequate summary. In many examples, 
however, additional information is available that was not used in the design, and it is 
generally advisable to try to use all available information in a Bayesian analysis, thus going 
beyond the minimal analysis. 

For example, suppose a simple random sample of size 100 is drawn from a large pop- 
ulation that is known to be 51% female and 49% male, and the sex of each respondent is 
recorded along with the answer to the target question. A minimal summary of the data does 
not include sex of the respondents, but a better analysis models the responses conditional 
on sex and then obtains inferences for the general population by averaging the results for 
the two sexes, thus obtaining posterior inferences using the data as if they came from a 
stratified sample. (Posterior predictive checks for this problem could still be based on the 
simple random sampling design.) 

On the other hand, if the population frequencies of males and females were not known, 
then sex would not be a fully observed covariate, and the frequencies of men and women 
would themselves have to be estimated in order to estimate the joint distribution of sex 
and the target question in the population. In that case, in principle the joint analysis could 
be informative for the purpose of estimating the distribution of the target variable, but in 
practice the adequate summary might be more appealing because it would not require the 
additional modeling effort involving the additional unknown parameters. See Exercise 8.6 
for further discussion of this point. 


Example. An experiment with treatment assignments based on observed 
covariates 

An experimenter can sometimes influence treatment assignments even in a randomized 
design. For example, an experiment was conducted on 50 cows to estimate the effect of 
a feed additive (methionine hydroxy analog) on six outcomes related to the amount of 
milk fat produced by each cow. Four diets (treatments) were considered, corresponding 
to different levels of the additive, and three variables were recorded before treatment 
assignment: lactation number (seasons of lactation), age, and initial weight of cow. 


This electronic edition is for non-commercial purposes only. 


218 8. MODELING ACCOUNTING FOR DATA COLLECTION 


Cows were initially assigned to treatments completely at random, and then the distri- 
butions of the three covariates were checked for balance across the treatment groups; 
several randomizations were tried, and the one that produced the ‘best’ balance with 
respect to the three covariates was chosen. The treatment assignment is ignorable 
(because it depends only on fully observed covariates and not on unrecorded variables 
such as the physical appearances of the cows or the times at which the cows entered 
the study) but unknown (because the decisions whether to re-randomize are not ex- 
plained). In our general notation, the covariates x are a 50 x3 matrix and the complete 
data y are a 50 x 24 matrix, with only one of 4 possible subvectors of dimension 6 
observed for each unit (there were 6 different outcome measures relating to the cows’ 
diet after treatment). 

The minimal analysis uses a model of mean daily milk fat conditional on the treatment 
and the three pre-treatment variables. This analysis, based on ignorability, implicitly 
assumes distinct parameters; reasonable violations of this assumption should not have 
large effects on our inferences for this problem (see Exercise 8.3). A linear additive 
model seems reasonable, after appropriate transformations (see Exercise 14.5). As 
usual, one would first compute and draw samples from the posterior distribution of the 
superpopulation parameters—the regression coefficients and variance. In this example, 
there is probably no reason to compute inferences for finite-population estimands such 
as the average treatment effects, since there was no particular interest in the 50 cows 
that happened to be in the study. 

If the goal is more generally to understand the treatment effects, it would be better to 
model the multivariate outcome y—the six post-treatment measurements—conditional 
on the treatment and the three pre-treatment variables. After appropriate transfor- 
mations, a multivariate normal regression could make sense. 

The only issue we need to worry about is modeling y, unless ¢ is not distinct from 6 (for 
example, if the treatment assignment rule chosen is dependent on the experimenter’s 
belief about the treatment efficacy). 


For fixed model and data, the posterior distribution of 0 and finite population estimands 
is the same for all ignorable data collection models. However, better designs are likely to 
yield data exhibiting less sensitivity to variations in models. In the cow experiment, a 
better design would have been to explicitly balance over the covariates, most simply using 
a randomized block design. 


8.5 Sensitivity and the role of randomization 


We have seen how ignorable designs facilitate Bayesian analysis by allowing us to model 
observed data directly. To put it another way, posterior inference for 0 and ymis is completely 
insensitive to the details of an ignorable design, given a fixed model for the data. 


Complete randomization 


How does randomization fit into this picture? First, consider the situation with no fully 
observed covariates x, in which case the only way to have an ignorable design—that is, a 
probability distribution, p(4,...,I,|@), that is invariant to permutations of the indexes—is 
to randomize (excluding the degenerate designs in which all units get assigned one of the 
treatments). 

However, for any given inferential goal, some ignorable designs are better than others 
in the sense of being more likely to yield data that provide more precise inferences about 
estimands of interest. For example, for estimating the average treatment effect in a group of 
10 subjects with a noninformative prior distribution on the distribution of outcomes under 
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each of two treatments, the strategy of assigning a random half of the subjects to each 
treatment is generally better than flipping a coin to assign a treatment to each subject 
independently, because the expected posterior precision of estimation is higher with equal 
numbers of subjects in each treatment group. 


Randomization given covariates 


When fully observed covariates are available, randomized designs are in competition with 
deterministic ignorable designs, that is, designs with propensity scores equal to 0 or 1. 
What are the advantages, if any, of randomization in this setting, and how does knowledge 
of randomization affect Bayesian data analysis? Even in situations where little is known 
about the units, distinguishing unit-level information is usually available in some form, for 
example telephone numbers in a telephone survey or physical location in an agricultural 
experiment. For a simple example, consider a long stretch of field divided into 12 adjacent 
plots, on which the relative effects of two fertilizers, A and B, are to be estimated. Compare 
two designs: assignment of the six plots to each treatment at random, or the systematic 
design ABABABBABABA. Both designs are ignorable given x, the locations of the plots, 
and so a usual Bayesian analysis of y given x is appropriate. The randomized design is 
ignorable even not given x, but in this setting it would seem advisable, for the purpose 
of fitting an accurate model, to include at least a linear trend for E(y|x) no matter what 
design is used to collect these data. 

So are there any potential advantages to the randomized design? Suppose the random- 
ized design were used. Then an analysis that pretends x is unknown is still a valid Bayesian 
analysis. Suppose such an analysis is conducted, yielding a posterior distribution for Ymis, 
P(Ymis|\Yobs). Now pretend x is suddenly observed; the posterior distribution can then be 
updated to produce p(Ymis|Yops,). (Since the design is ignorable, the inferences are also 
implicitly conditional on I.) Since both analyses are correct given their respective states 
of knowledge, we would expect them to be consistent with each other, with p(ymis|Yobs, £) 
expected to be more precise as well as conditionally appropriate given x. If this is not the 
case, we should reconsider the modeling assumptions. This extra step of model examination 
is not available in the systematic design without explicitly averaging over a distribution of 
Ë. 

Another potential advantage of the randomized design is the increased flexibility for 
carrying out posterior predictive checks in hypothetical future replications. With the ran- 
domized design, future replications give different treatment assignments to the different 
plots. 

Finally, any particular systematic design is sensitive to associated particular model as- 
sumptions about y given x, and so repeated use of a single systematic design would cause a 
researcher’s inferences to be systematically dependent on a particular assumption. In this 
sense, there is a benefit to using different patterns of treatment assignment for different 
experiments; if nothing else about the experiments is specified, they are exchangeable, and 
the global treatment assignment is necessarily randomized over the set of experiments. 


Designs that ‘cheat’ 


Another advantage of randomization is to make it more difficult for an experimenter to 

‘cheat,’ intentionally or unintentionally, by choosing sample selections or treatment assign- 

ments in such a way as to bias the results (for example, assigning treatments A and B to 

sunny and shaded plots, respectively). This complication can enter a Bayesian analysis in 

several ways: 

1. If treatment assignment depends on unrecorded covariates (for example, an indicator 
for whether each plot is sunny or shaded), then it is not ignorable, and the resulting 
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unknown selection effects must be modeled, at best a difficult enterprise with heightened 
sensitivity to model assumptions. 


2. If the assignment depends on recorded covariates, then the dependence of the outcome 
variable the covariates, p(y|x,@), must be modeled. Depending on the actual pattern of 
treatment assignments, the resulting inferences may be highly sensitive to the model; for 
example, if all sunny plots get A and all shaded plots get B, then the treatment indicator 
and the sunny/shaded indicator are identical, so the observed-data likelihood provides 
no information to distinguish between them. 


3. Even a randomized design can be nonignorable if the parameters are not distinct. For 
example, consider an experimenter who uses complete randomization when he thinks 
that treatment effects are large but uses randomized blocks for increased precision when 
he suspects smaller effects. In this case, the assignment mechanism depends on the prior 
distribution of the treatment effects; in our general notation, @ and 0 are dependent in 
their prior distribution, and we no longer have ignorability. In practice, one can often 
ignore such effects if the data dominate the prior distribution, but it is theoretically 
important to see how they fit into the Bayesian framework. 


Bayesian analysis of nonrandomized studies 


Randomization is a method of ensuring ignorability and thus making Bayesian inferences less 
sensitive to assumptions. Consider the following nonrandomized sampling scheme: in order 
to estimate the proportion of the adults in a city who hold a certain opinion, an interviewer 
stands on a street corner and interviews all the adults who pass by between 11 am and noon 
on a certain day. This design can be modeled in two ways: (1) nonignorable because the 
probability that adult 2 is included in the sample depends on that person’s travel pattern, a 
variable that is not observed for the N —n adults not in the sample; or (2) ignorable because 
the probability of inclusion in the sample depends on a fully observed indicator variable, £i, 
which equals 1 if adult 7 passed by in that hour and 0 otherwise. Under the nonignorable 
parameterization, we must specify a model for J given y and include that factor in the 
likelihood. Under the ignorable model, we must perform inference for the distribution of 
y given x, with no data available when x = 0. In either case, posterior inference for the 
estimand of interest, Y, is highly sensitive to the prior distribution unless + is close to 1. 
In contrast, if covariates x are available and a nonrandomized but ignorable and roughly 
balanced design is used, then inferences are typically less sensitive to prior assumptions 
about the design mechanism. Any strongly ignorable design is implicitly randomized over 
all variables not in x, in the sense that if two units 7 and j have identical values for all 
covariates x, then their propensity scores are equal: p(J;|x) = p(J;|x). Our usual goal 
in randomized or nonrandomized studies is to set up an ignorable model so we can use 
the standard methods of Bayesian inference developed in most of this book, working with 


p(O|a, Yobs) xX p(O)p(a, Yoos|9). 


8.6 Observational studies 
Comparison to experiments 


In an observational or nonexperimental study, data are typically analyzed as if they came 
from an experiment—that is, with a treatment and an outcome recorded for each unit—but 
the treatments are simply observed and not under the experimenter’s control. For example, 
the SAT coaching study presented in Section 5.5 involves experimental data, because the 
students were assigned to the two treatments by the experimenter in each school. In that 
case, the treatments were assigned randomly; in general, such studies are often, but not 
necessarily, randomized. The data would have arisen from an observational study if, for 
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Figure 8.2 Hypothetical-data illustrations of sensitivity analysis for observational studies. In each 
graph, circles and dots represent treated and control units, respectively. (a) The first plot shows 
balanced data, as from a randomized experiment, and the difference between the two lines shows the 
estimated treatment effect from a simple linear regression. (b, c) The second and third plots show 
unbalanced data, as from a poorly conducted observational study, with two different models fit to 
the data. The estimated treatment effect for the unbalanced data in (b) and (c) is highly sensitive 
to the form of the fitted model, even when the treatment assignment is ignorable. 


example, the students themselves had chosen whether or not to enroll in the coaching 
programs. 

In a randomized experiment, the groups receiving each treatment will be similar, on 
average, or with differences that are known ahead of time by the experimenter (if a design 
is used that involves unequal probabilities of assignment). In an observational study, in 
contrast, the groups receiving each treatment can differ greatly, and by factors not measured 
in the study. 

A well-conducted observational study can provide a good basis for drawing causal in- 
ferences, provided that (1) it controls well for background differences between the units 
exposed to the different treatments; (2) enough independent units are exposed to each 
treatment to provide reliable results (that is, narrow-enough posterior intervals); (3) the 
study is designed without reference to the outcome of the analysis; (4) attrition, dropout, 
and other forms of unintentional missing data are minimized or else appropriately modeled; 
and (5) the analysis takes account of the information used in the design. 

Many observational studies do not satisfy these criteria. In particular, systematic pre- 
treatment differences between groups should be included in the analysis, using background 
information on the units or with a realistic nonignorable model. Minor differences between 
different treatment groups can be controlled, at least to some extent, using models such as 
regression, but with larger differences, posterior inferences become highly sensitive to the 
functional form and other details used in the adjustment model. In some cases, the use of 
estimated propensity scores can help in limiting the sensitivity of such analyses by restricting 
to a subset of treated and control units with similar distributions of the covariates. In this 
context, the propensity score is the probability (as a function of the covariates) that a unit 
receives the treatment. 

Figure 8.2 illustrates the connection between lack of balance in data collection and 
sensitivity to modeling assumptions in the case with a single continuous covariate. In the 
first of the three graphs, which could have come from a randomized experiment, the two 
groups are similar with respect to the pre-treatment covariate. As a result, the estimated 
treatment effect—the difference between the two regression lines in an additive model—is 
relatively insensitive to the form of the model fit to y|a. The second and third graphs 
show data that could arise from a poorly balanced observational study (for example, an 
economic study in which a certain training program is taken only by people who already 
had relatively higher incomes). From the second graph, we see that a linear regression yields 
a positive estimated treatment effect. However, the third graph shows that the identical 
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data could be fitted just as well with a mildly nonlinear relation between y and x, without 
any treatment effect. In this hypothetical scenario, estimated treatment effects from the 
observational study are extremely sensitive to the form of the fitted model. 

Another way to describe this sensitivity is by using estimated propensity scores. Suppose 
we estimate the propensity score, Pr(J; = 1|X;), assuming strongly ignorable treatment 
assignment, for example using a logistic regression model (see Section 3.7 and Chapter 16). 
In the case of Figure 8.2b,c, there will be little or no overlap in the estimated propensity 
scores in the two treatment conditions. This technique works even with multivariate X 
because the propensity score is a scalar and so can be extremely useful in observational 
studies with many covariates to diagnose and reveal potential sensitivity to models for 
causal effects. 


Bayesian inference for observational studies 


In Bayesian analysis of observational studies, it is typically important to gather many co- 
variates so that the treatment assignment is close to ignorable conditional on the covariates. 
Once ignorability is accepted, the observational study can be analyzed as if it were an exper- 
iment with treatment assignment probabilities depending on the included covariates. We 
shall illustrate this approach in Section 14.3 for the example of estimating the advantage 
of incumbency in legislative elections. In such examples, the collection of relevant data 
is essential, because without enough covariates to make the design approximately ignor- 
able, sensitivity of inferences to plausible missing data models can be so great that the 
observed data may provide essentially no information about the questions of interest. As 
illustrated in Figure 8.2, inferences can still be sensitive to the model even if ignorability 
is accepted, if there is little overlap in the distributions of covariates for units receiving 
different treatments. 

Data collection and organization methods for observational studies include matched 
sampling, subclassification, blocking, and stratification, all of which are methods of intro- 
ducing covariates x in a way to limit the sensitivity of inference to the specific form of the y 
given x models by limiting the range of x-space over which these models are being used for 
extrapolation. Specific techniques that arise naturally in Bayesian analyses include post- 
stratification and the analysis of covariance (regression adjustment). Under either of these 
approaches, inference is performed conditional on covariates and then averaged over the 
distribution of these covariates in the population, thereby correcting for differences between 
treatment groups. 

Two general difficulties arise when implementing this plan for analyzing observational 
studies: 


1. Being out of the experimenter’s control, the treatments can easily be unbalanced. Con- 
sider the educational testing example of Section 5.5 and suppose the data arose from 
observational studies rather than experiments. If, for example, good students received 
coaching and poor students received no coaching, then the inference in each school would 
be highly sensitive to model assumptions (for example, the assumption of an additive 
treatment effect, as illustrated in Figure 8.2). This difficulty alone can make a dataset 
useless in practice for answering questions of substantive interest. 


2. Typically, the actual treatment assignment in an observational study depends on several 
unknown and even possibly unmeasurable factors (for example, the state of mind of the 
student on the day of decision to enroll in a coaching program), and so inferences about 
0 are sensitive to assumptions about the nonignorable model for treatment assignment. 


Data gathering and analysis in observational studies for causal effects is a vast area. 
Our purpose in raising the topic here is to connect the general statistical ideas of ignorabil- 
ity, sensitivity, and using available information to the specific models of applied Bayesian 
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Assignment, Exposure, Survival, units in 
Category Tobis Uobs,i Yoksa category 
Complier or never-taker 0 0 0 74 
Complier or never-taker 0 0 1 11514 
Never-taker 1 0 0 34 
Never-taker 1 0 1 2385 
Complier 1 1 0 12 
Complier 1 1 1 9663 


Table 8.5 Summary statistics from an experiment on vitamin A supplements, where the vitamin 
was available (but optional) only to those assigned the treatment. The table shows number of units 
in each assignment/exposure/outcome condition. From Sommer and Zeger (1991). 


statistics discussed in the other chapters of this book. For example, in Sections 8.3 and 8.4, 
we illustrated how the information inherent in stratification in a survey and blocking in an 
experiment can be used in Bayesian inference via hierarchical models. In general, the use of 
covariate information increases expected precision under specific models and, if well done, 
reduces sensitivity to alternative models. However, too much blocking can make modeling 
more difficult and sensitive. See Exercise 8.6 for more discussion of this point. 


Causal inference and principal stratification 


Principal stratification refers to a method of adjusting for an outcome variable that is ‘inter- 
mediate to’ or ‘on the causal pathway’ to the final outcome y. Suppose we call this interme- 
diate outcome C,, with corresponding potential outcomes C'(1) and C(0), respectively, if an 
individual is assigned to treatment condition 1 or 0. If J is an indicator variable for assign- 
ment to condition 1, then the observed value is Cops, where Cops; = LCi(1)+(1—-L:)Ci (0). A 
common mistake is to treat Cops as if it were a covariate, which it is not, unless C(1) = C(0), 
and do an analysis stratified by Cops. The correct procedure is to stratify on the joint val- 
ues (C(1),C(0)), which are unaffected by assignment J and so can be treated as a vector 
covariate. Thus, stratifying by C(1), C(0) is legitimate and is called ‘principal stratification.’ 

There are many examples of this general principle of stratifying on intermediate out- 
comes, the most common being compliance with assigned treatment. This is an important 
topic for a few reasons. First, it is a bridge to the economists’ tool of instrumental variables, 
as we shall discuss. Second, randomized experiments with noncompliance can be viewed as 
‘islands’ between the ‘shores’ of perfect randomized experiments and purely observational 
studies. Third, noncompliance is an important introduction to more complex examples of 
principal stratification. 


Example. A randomized experiment with noncompliance 

A large randomized experiment assessing the effect of vitamin A supplements on infant 
mortality was conducted in Indonesian villages, where vitamin A was only available to 
those assigned to take it. In the context of this example J is an indicator for assignment 
to the vitamin A treatment. The intermediate outcome in this case is compliance. 
There are two principal strata here, defined by whether or not the units would take 
vitamin A if assigned it: compliers and noncompliers. The strata are observed for the 
units assigned to take vitamin A because we know whether they comply or not, but 
the strata are not observed for the units assigned to control because we do not know 
what they would have done had they been assigned to the vitamin A group. In terms 
of the notation above, we know C;(0) = 0 for everyone, indicating no one would take 
vitamin A when assigned not to take it (they have no way of getting it), and know 
C;(1) only for those units assigned to take vitamin A. The data are summarized in 
Table 8.5. 


This electronic edition is for non-commercial purposes only. 


224 8. MODELING ACCOUNTING FOR DATA COLLECTION 
Complier average causal effects and instrumental variables 


In a randomized experiment with noncompliance, such as the Indonesian vitamin A exper- 
iment just described, the objective is to estimate the causal effect of the treatment within 
each principal stratum (see Exercise 8.16). We shall use that setting to describe the in- 
strumental variable approach (popular in econometrics) for estimating causal effects within 
each stratum and then compare it to the Bayesian likelihood approach. The average causal 
effects for compliers and noncompliers are called the complier average causal effect (CACE) 
and the noncomplier average causal effect (NACE), respectively. The overall average causal 
effect of being assigned to the treatment (averaged over compliance status) is 


Yı — Yo = pe: CACE + (1 — pe) - NACE, (8.18) 


where pe is the proportion of compliers in the population. Expression (8.18) is known as 
the intention-to-treat effect because it measures the effect over the entire population that 
we intend to treat (including those who do not comply with the treatment assigned and 
therefore do not reap its potential benefits). 

In the case of a randomized experiment, we can estimate the intention-to-treat effect 
Yı—Yo with the usual estimate, 7, —Jp. It is also straightforward to estimate the proportion 
of compliers pe in the population, with the estimate being the proportion of compliers in 
the random half assigned to take vitamin A. We would like to estimate CACE—the effect 
of the treatment for those who actually would take it if assigned. 

Suppose we assume that there is no effect on mortality of being assigned to take vitamin 
A for those who would not take it even when assigned to take it. This is known as the 
exclusion restriction because it excludes the causal effect of treatment assignment for non- 
compliers. This assumption means that NACE = 0, and then a simple estimate of CACE 
is 

CACE = (Jı — Yo) /Be, (8.19) 


which is called the instrumental variables estimate in economics. The instrumental variables 
estimate for CACE is thus the estimated intention-to-treat effect on Y, divided by the 
proportion of the treatment group who are compliers, that is, who actually receive the new 
treatment. 


Bayesian causal inference with noncompliance 


The instrumental variables approach to noncompliance is effective at revealing how simple 
assumptions can be used to address noncompliance. The associated estimate (8.19) is 
a ‘method of moments’ estimate, however, which is generally far from satisfactory. The 
Bayesian approach is to treat the unknown compliance status for each person in the control 
group explicitly as missing data. A particular advantage of the Bayesian approach is the 
freedom to relax the exclusion restriction. A more complex example with noncompliance 
involves a study where encouragement to receive influenza shots is randomly assigned, 
but many patients do not comply with their encouragements, thereby creating two kinds of 
noncompliers (those who are encouraged to receive a shot but do not, and those who are not 
encouraged to receive a shot but do so anyway) in addition to compliers. The computations 
are easily done using iterative simulation methods of the sort discussed in Part III. 


8.7 Censoring and truncation 


We illustrate a variety of possible missing data mechanisms by considering a series of vari- 
ations on a simple example. In all these variations, it is possible to state the appropriate 
model directly—although as examples become more complicated, it is useful and ultimately 
necessary to work within a formal structure for modeling data collection. 
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The example involves observation of a portion of a ‘complete’ dataset that are N weigh- 
ings yi, i = 1,...,N, of an object with unknown weight 0, that are assumed to follow a 
N(6,1) distribution, with a noninformative uniform prior distribution on @. Initially N is 
fixed at 100. In each variation, a different data-collection rule is followed, and in each case 
only n = 91 of the original N measurements are observed. We label Y ps as the mean of the 
observed measurements. 

For each case, we first describe the data collection and then present the Bayesian analysis. 
This example shows how a fixed dataset can have different inferential implications depending 
on how it was collected. 


1. Data missing completely at random 


Suppose we weigh an object 100 times on an electronic scale with a known N(@,1) mea- 
surement distribution, where 0 is the true weight of the object. Randomly, with probability 
0.1, the scale fails to report a value, and we observe 91 values. Then the complete data y 
are N(6,1) subject to Bernoulli sampling with known probability of selection of 0.9. Even 
though the sample size n = 91 is binomially distributed under the model, the posterior 
distribution of 0 is the same as if the sample size of 91 had been fixed in advance. 

The inclusion model is J; ~ Bernoulli(0.9), independent of y, and the posterior distri- 
bution is, 

p(Olyobs, I) = p(Olyovs) = N(0|Tovs: 1/91). 


2. Data missing completely at random with unknown probability of missingness 


Consider the same situation with 91 observed values and 9 missing values, except that the 
probability that the scale randomly fails to report a weight is unknown. Now the complete 
data are N(@,1) subject to Bernoulli sampling with unknown probability of selection, 7. 
The inclusion model is then [;|7 ~ Bernoulli(z), independent of y. 

The posterior distribution of 0 is the same as in variation 1 only if 0 and 7 are independent 
in their prior distribution, that is, are ‘distinct’ parameters. If 0 and m are dependent, then 
n = 91, the number of reported values, provides extra information about 0 beyond the 91 


measured weights. For example, if it is known that 7 = i then | = 2L can be used to 


100 
estimate 7, and thereby 0 = ;+_, even if the measurements y were not recorded. 


T? 


Formally, the posterior distribution is, 


P(9, TlYobs, I) x P(9, T)P(Yoos, L|9, 7) 
x pO, T)N(9|Gobs: 1/91)Bin(n|100, T). 


This formula makes clear that if 0 and 7 are independent in the prior distribution, then the 


posterior inference for 0 is as above. If 7 = — then the posterior distribution of 0 is, 


p(Oly, I) x N(O|%,5; 1/91) Bin(n|100, 6/(1 + 8)). 


Given n = 91 and Jobs; this density can be calculated numerically over a range of 0, and 
then simulations of 0 can be drawn using the inverse-cdf method. 


3. Censored data 


Now modify the scale so that all weights produce a report, but the scale has an upper limit 
of 200 kg for reports: all values above 200 kg are reported as ‘too heavy.’ The complete 
data are still N(@, 1), but the observed data are censored; if we observe ‘too heavy,’ we know 
that it corresponds to a weighing with a reading above 200. Now, for the same 91 observed 
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weights and 9 ‘too heavy’ measurements, the posterior distribution of 0 differs from that 
of variation 2 of this example. In this case, the contributions to the likelihood from the 91 
numerical measurements are normal densities, and the contributions from the 9 ‘too heavy’ 
measurements are of the form ®(@ — 200), where ® is the normal cumulative distribution 
function. 

The inclusion model is Pr(J; = 1]y;) = 1 if y; < 200, and 0 otherwise. The posterior 
distribution is, 


P(A|Yows, 7) x p(9)P(Yors, 110) 
p(@) [Pluss Ymis; 1\0)dymis 


91 9 
= p0) lI N(yopsil0, 1) lI (6 — 200) 
x Nove. 1/91)[®(0 — 900))°. 


Given Jobs, this density can be calculated numerically over a range of 0, and then simulations 
of 0 can be drawn using the inverse-cdf method. 


4. Censored data with unknown censoring point 


Now extend the experiment by allowing the censoring point to be unknown. Thus the 
complete data are distributed as N(@, 1), but the observed data are censored at an unknown 
@, rather than at 200 as in the previous variation. Now the posterior distribution of 0 
differs from that of the previous variation because the contributions from the 9 ‘too heavy’ 
measurements are of the form ®(6 — ¢). Even when 0 and ¢ are a priori independent, 
these 9 contributions to the likelihood create dependence between 0 and ¢ in the posterior 
distribution, and so to find the posterior distribution of 6, we must consider the joint 
posterior distribution p(9, ¢). 

The posterior distribution is then p(0,dlyops,l) œ p(dl@)N(AlJons: 1/91)[®(8 — 4)]°. 
Given Jons, this density can be calculated numerically over a grid of (0,¢), and then sim- 
ulations of (0, ġ) can be drawn using the grid method (as in the example in Figure 3.3 on 
page 76). 

We can formally derive the censored-data model using the observed and missing-data 
notation, as follows. Label y = (yi,..., ya) as the original N = 100 uncensored weighings: 
the complete data. The observed information consists of the n = 91 observed values, 
Yoos = (Yobs1;---5Yobs91), and the inclusion vector, I = (,...,li00), which is composed 
of 91 ones and 9 zeros. There are no covariates, x. 

The complete-data likelihood in this example is 


100 


P(y|@) = [1 Nwie, 1), 


and the likelihood of the inclusion vector, given the complete data, has a simple independent, 
identically distributed form: 


100 


plIly, $) []ouilui. o) 


iil 1 if (I; =1and y; < ¢) or (I; = 0 and y; > 4) 
0 otherwise. 


For valid Bayesian inference we must condition on all observed data, which means we need 
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the joint likelihood of yops and Z, which we obtain mathematically by integrating out Ymis 
from the complete-data likelihood: 


plore, eg) = J EE T 


J p(yld, #)p(Tly, 0, &) denis 
= JT] soi» JI f, N(uil, Yp(Lilys, )dyi 


ie Peal i: ;=0 
= [I Noon) TI 20-0) 
oe al i: 1; =0 
91 
7 (lec [2 —9)). (8.20) 


Since the joint posterior distribution p(6, ¢|Yobs, Z) is proportional to the joint prior distribu- 
tion of (@,¢) multiplied by the likelihood (8.20), we have provided an algebraic illustration 
of a case where the unknown ¢ cannot be ignored in making inferences about 0. 

Thus we can see that the missing data mechanism is nonignorable. The likelihood of 
the observed measurements, if we (mistakenly) ignore the observation indicators, is 


91 
P(Yors|9) = | | N(yoos:l9, 1), 
i=1 
which is wrong—crucially different from the appropriate likelihood (8.20)—because it omits 
the factors corresponding to the censored observations. 


5. Truncated data 


Now suppose the object is weighed by someone else who only provides to you the 91 observed 
values, but not the number of times the object was weighed. Also, suppose, as in our first 
censoring example above, that we know that no values over 200 are reported by the scale. 
The complete data can still be viewed as N(6,1), but the observed data are truncated at 
200. The likelihood of each observed data point in the truncated distribution is a normal 
density divided by a normalizing factor of ®(200 — 0). We can proceed by working with 
this observed data likelihood, but we first demonstrate the connection between censoring 
and truncation. Truncated data differ from censored data in that no count of observations 
beyond the truncation point is available. With censoring, the values of observations beyond 
the truncation point are lost but their number is observed. 
Now that N is unknown, the joint posterior distribution is, 


P(9,N|Yovs, 7) œ p0, N)plyovs: 210, N) 
N 
x PON) (3 JNO ons 1/9080- 20017, 


So the marginal posterior distribution of 0 is, 


= N 
p( Clore 1) < PONO ons:1/91) X wCvie)(¢) JPO — 200)". 
N=91 
If p(8, N) x $, then this becomes, 
(0| I) N(8|Gops> 1/91) 3 EA fa [6(6 — 200) -°t 
P\U[Yobs; x Yobs> — WN \91 


This electronic edition is for non-commercial purposes only. 


228 8. MODELING ACCOUNTING FOR DATA COLLECTION 
A (N-1 
o N(lover1/91) X ( ) [6(6 — 200)}%-" 
N=91 90 


= N(ops, 1/91)[1 — ®(8 — 200)|-*, 


where the last line can be derived because the summation has the form of a negative binomial 
density with 0 = N—91, a = 91, and wi = (0 — 200) (see Appendix A). Thus there are 
two ways to end up with the posterior distribution for 0 in this case, by using the truncated 


likelihood or by viewing this as a case of censoring with p(N) « +. 


6. Truncated data with unknown truncation point 


Finally, extend the variations to allow an unknown truncation point; that is, the complete 
data are N(0, 1), but the observed data are truncated at an unknown value ¢. Here the pos- 
terior distribution of 0 is a mixture of posterior distributions with known truncation points 
(from the previous variation), averaged over the posterior distribution of the truncation 
point, ¢. This posterior distribution differs from the analogous one for censored data in 
variation 4. With censored data and an unknown censoring point, the proportion of values 
that are observed provides relatively powerful information about the censoring point @¢ (in 
units of standard deviations from the mean), but this source of information is absent with 
truncated data. 
Now the joint posterior distribution is, 


p JNO ons 1/90180 — pn. 


P(O, d, Nlyonss T) œ pl, 0N) (oy 


Once again, we can sum over N to get a marginal density of (0, ¢). If, as before, p(.N|0, ġ) « 
1/N, then 
pO, $|Yos; I) x p(0, PIN (Fors: 1/91)[1 z (0 E pT’. 


With a noninformative prior density, this joint posterior density actually implies an im- 
proper posterior distribution for ¢, because as ¢ —> œo, the factor [1 — ®(@ — ¢)|~*! ap- 
proaches 1. In this case, the marginal posterior density for 0 is just, 


P(O|Yors) = N(9|Gobs: 1/91), 


as in the first variation of this example. 
We could continue to exhibit more and more complex variations in which the data 
collection mechanism influences the posterior distribution. 


More complicated patterns of missing data 


Incomplete data can be observed in other forms too, such as rounded or binned data (for 
example, heights rounded to the nearest inch, ages rounded down to the nearest integer, 
or income reported in discrete categories); see Exercise 3.5. With categorical data, it can 
happen that one knows which of a set of categories a data point belongs to, but not the exact 
category. For example, a survey respondent might report being Christian without specifying 
Protestant, Catholic, or other. Section 18.6 illustrates patterns of missing categorical data 
induced by nonresponse in a survey. 

For more complicated missing data patterns, one must generalize the notation of Section 
8.2 to allow for partial information such as censoring points, rounding, or data observed 
in coarse categories. The general Bayesian approach still holds, but now the observation 
indicators J; are not simply 0’s and 1’s but more generally indicate to which set in the 
sample space y; can belong. 
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8.8 Discussion 


In general, the method of data collection dictates the minimal level of modeling required for 
a valid Bayesian analysis, that is, conditioning on all information used in the design—for 
example, conditioning on strata and clusters in a sample survey or blocks in an experiment. 
A Bayesian analysis that is conditional on enough information can ignore the data collection 
mechanism for inference although not necessarily for model checking. As long as data have 
been recorded on all the variables that need to be included in the model—whether for 
scientific modeling purposes or because they are used for data collection—one can proceed 
with the methods of modeling and inference discussed in the other chapters of this book, 
notably using regression models for p(yobs|x,0) as in the models of Parts IV and V. As 
usual, the greatest practical advantages of the Bayesian approach come from accounting for 
uncertainty in a multiparameter setting, and hierarchical modeling of the data-collection 
process and of the underlying process under study. 


8.9 Bibliographic note 


The material in this chapter on the role of study design in Bayesian data analysis devel- 
ops from a sequence of contributions on Bayesian inference with missing data, where even 
sample surveys and studies for causal effects are viewed as problems of missing data. The 
general Bayesian perspective was first presented in Rubin (1976), which defined the concepts 
of ignorability, missing at random, and distinctness of parameters; related work appears in 
Dawid and Dickey (1977). The notation of potential outcomes with fixed unknown values in 
randomized experiments dates back to Neyman (1923) and is standard in that context (see 
references in Speed, 1990, and Rubin, 1990); this idea was introduced for causal inference 
in observational studies by Rubin (1974b). More generally, Rubin (1978a) applied the per- 
spective to Bayesian inference for causal effects, where treatment assignment mechanisms 
were treated as missing-data mechanisms. Dawid (2000) and the accompanying discussions 
present a variety of Bayesian and related perspectives on causal inference; see also Green- 
land, Robins, and Pearl (1999), Robins (1998), Rotnitzky, Robins, and Scharfstein (1999), 
and Pearl (2010). 

David et al. (1986) examined the reasonableness of the missing-at-random assumption for 
a problem in missing data imputation. The stability assumption in experiments was defined 
in Rubin (1980a), further discussed in the second chapter of Rubin (1987a), which explicates 
this approach to survey sampling, and extended in Rubin (1990). Smith (1983) discusses 
the role of randomization for Bayesian and non-Bayesian inference in survey sampling. 
Work on Bayesian inference before Rubin (1976) did not explicitly consider models for 
the data collection process, but rather developed the analysis directly from assumptions of 
exchangeability; see, for example, Ericson (1969) and Scott and Smith (1973) for sample 
surveys and Lindley and Novick (1981) for experiments. Rosenbaum and Rubin (1983a) 
introduced the expression ‘strongly ignorable.’ 

The problem of Bayesian inference for data collected under sequential designs has been 
the subject of much theoretical study and debate, for example, Barnard (1949), Anscombe 
(1963), Edwards, Lindman, and Savage (1963), and Pratt (1965). Berger (1985, Chapter 
7), provides an extended discussion from the perspective of decision theory. Rosenbaum 
and Rubin (1984a) and Rubin (1984) discuss the relation between sequential designs and 
robustness to model uncertainty. The book by Berry et al. (2010) offers practical guidance 
for Bayesian analyses in sequential designs. 

There is a vast statistical literature on the general problems of missing data, surveys, 
experiments, observational studies, and censoring and truncation discussed in this chapter. 
We present a few of the recent references that apply modern Bayesian models to various 
data collection mechanisms. Little and Rubin (2002) present many techniques and relevant 
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theory for handling missing data, some of which we review in Chapter 18. Heitjan and Rubin 
(1990, 1991) generalize missing data to coarse data, which includes rounding and heaping; 
previous work on rounding is reviewed by Heitjan (1989). Analysis of record-breaking data, 
a kind of time-varying censoring, is discussed by Carlin and Gelfand (1993). 

Rosenbaum (2010) thoroughly reviews observational studies from a non-Bayesian per- 
spective. Rosenbaum and Rubin (1983b) present a study of sensitivity to nonignorable 
models for observational studies. Heckman (1979) is an influential work on nonignorable 
models from an econometric perspective. Greenland (2005) discusses models for biases in 
observational studies. 

Hierarchical models for stratified and cluster sampling are discussed by Scott and Smith 
(1969), Little (1991, 1993), and Nandaram and Sedransk (1993); related material also ap- 
pears in Skinner, Holt, and Smith (1989). Gelman and Carlin (2001), Gelman (2007a) and 
Schutt (2009) discuss connections between hierarchical models and survey weights. The 
survey of Australian schoolchildren in Section 8.3 is described in Carlin et al. (1997). 

The introduction to Goldstein and Silver (1989) discusses the role of designs of surveys 
and experiments in gathering data for the purpose of estimating hierarchical models. Hier- 
archical models for experimental data are also discussed by Tiao and Box (1967) and Box 
and Tiao (1973). 

Rubin (1978a) and Kadane and Seidenfeld (1990) discuss randomization from two dif- 
ferent Bayesian perspectives; see also Senn (2013). Rubin (1977) discusses the analysis of 
designs that are ignorable conditional on a covariate. Rosenbaum and Rubin (1983a) intro- 
duce the idea of propensity scores, which can be minimally adequate summaries as defined 
by Rubin (1985); technical applications of propensity scores to experiments and observa- 
tional studies appear in Rosenbaum and Rubin (1984b, 1985). Later work on propensity 
score methods includes Rubin and Thomas (1992, 2000) and Imbens (2000). There is now 
a relatively vast applied literature on propensity scores; see, for example, Connors et al. 
(1996). 

Frangakis and Rubin (2002) introduced the expression ‘principal stratification’ and dis- 
cuss its application to sample outcomes. Rubin (1998, 2000) presents the principal strati- 
fication approach to the problem of ‘censoring due to death,’ and Zhang (2002) develops a 
Bayesian attack on the problem. The vitamin A experiment, along with a general discus- 
sion of the connection between principal stratification and instrumental variables, appears 
in Imbens, and Rubin (1997). Recent applications of these ideas in medicine and public 
policy include Dehejia and Wahba (1999), Hirano et al. (2000) and Barnard et al. (2003). 
Imbens and Angrist (1994) give a non-Bayesian presentation of instrumental variables for 
causal inference; see also McClellan, McNeil, and Newhouse (1994) and Newhouse and Mc- 
Clellan (1994), as well as Bloom (1984) and Zelen (1979). Glickman and Normand (2000) 
connect principal stratification to continuous instrumental variables models. 


8.10 Exercises 


1. Definition of concepts: the concepts of randomization, exchangeability, and ignorability 
have often been confused in the statistical literature. For each of the following statements, 
explain why it is false but also explain why it has a kernel of truth. Illustrate with 
examples from this chapter or earlier chapters. 


(a) Randomization implies exchangeability: that is, if a randomized design is used, an 
exchangeable model is appropriate for the observed data, Yobs1,---; Yobsn- 


(b) Randomization is required for exchangeability: that is, an exchangeable model for 
Yobs 1;+++;Yobsn İS appropriate only for data that were collected in a randomized fash- 
ion. 


This electronic edition is for non-commercial purposes only. 


8.10. EXERCISES 231 


Block Treatment 

A B C D 
89 88 97 94 
8&4 77 92 79 
81 87 87 85 
87 92 89 84 
79 81 80 88 


oR WN FR 


Table 8.6 Yields of penicillin produced by four manufacturing processes (treatments), each applied 
in five different conditions (blocks). Four runs were made within each block, with the treatments 
assigned to the runs at random. From Box, Hunter, and Hunter (1978), who adjusted the data so 
that the averages are integers, a complication we ignore in our analysis. 


(c) Randomization implies ignorability; that is, if a randomized design is used, then it is 
ignorable. 

(d) Randomization is required for ignorability; that is, randomized designs are the only 
designs that are ignorable. 

(e) Ignorability implies exchangeability; that is, if an ignorable design is used, then an 
exchangeable model is appropriate for the observed data, Yobs1,--+; Yobsn- 

(£) Ignorability is required for exchangeability; that is, an exchangeable model for the vec- 
tor Yobs1;---;Yobsn İS appropriate only for data that were collected using an ignorable 
design. 


2. Application of design issues: choose an example from earlier in this book and discuss the 
relevance of the material in the current chapter to the analysis. In what way would you 
change the analysis, if at all, given what you have learned from the current chapter? 


3. Distinct parameters and ignorability: 


(a) For the milk production experiment in Section 8.4, give an argument for why the 
parameters ¢ and 0 may not be distinct. 

(b) If the parameters are not distinct in this example, the design is no longer ignorable. 
Discuss how posterior inferences would be affected by using an appropriate nonignor- 
able model. (You need not set up the model; just discuss the direction and magnitude 
of the changes in the posterior inferences for the treatment effect.) 


4. Interaction between units: consider a hypothetical agricultural experiment in which each 
of two fertilizers is assigned to 10 plots chosen completely at random from a linear array 
of 20 plots, and the outcome is the average yield of the crops in each plot. Suppose 
there is interference between units, because each fertilizer leaches somewhat onto the 
two neighboring plots. 


(a) Set up a model of potential data y, observed data yobs, and inclusion indicators T. 
The potential data structure will have to be larger than a 20 x 4 matrix in order to 
account for the interference. 

(b) Is the treatment assignment ignorable under this notation? 

(c) Suppose the estimand of interest is the average difference in yields under the two 
treatments. Define the finite-population estimand mathematically in terms of y. 

(d) Set up a probability model for y. 


5. Analyzing a designed experiment: Table 8.6 displays the results of a randomized blocks 
experiment on penicillin production. 
(a) Express this experiment in the general notation of this chapter, specifying x, Yobs, 
Ymis, N, and J. Sketch the table of units by measurements. How many observed 
measurements and how many unobserved measurements are there in this problem? 
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(b) Under the randomized blocks design, what is the distribution of I? Is it ignorable? Is 
it known? Is it strongly ignorable? Are the propensity scores an adequate summary? 

(c) Set up anormal-based model of the data and all relevant parameters that is conditional 
on enough information for the design to be ignorable. 

(d) Suppose one is interested in the (superpopulation) average yields of penicillin, av- 
eraging over the block conditions, under each of the four treatments. Express this 
estimand in terms of the parameters in your model. 

We return to this example in Exercise 15.2. 
6. Including additional information beyond the adequate summary: 

(a) Suppose that the experiment in the previous exercise had been performed by complete 
randomization (with each treatment coincidentally appearing once in each block), not 
randomized blocks. Explain why the appropriate Bayesian modeling and posterior 
inference would not change. 

(b) Describe how the posterior predictive check would change under the assumption of 
complete randomization. 

(c) Why is the randomized blocks design preferable to complete randomization in this 
problem? 

(d) Give an example illustrating why too much blocking can make modeling more difficult 
and sensitive to assumptions. 

7. Simple random sampling: 


(a) Derive the exact posterior distribution for 7 under simple random sampling with the 
normal model and noninformative prior distribution. 

(b) Derive the asymptotic result (8.6). 

8. Finite-population inference for completely randomized experiments: 

(a) Derive the asymptotic result (8.17). 

(b) Derive the (finite-population) inference for Y4 — Ypg under a model in which the pairs 
(yê, yP) are drawn from a bivariate normal distribution with mean (u4, w?), standard 
deviations (74,08), and correlation p. 

(c) Discuss how inference in (b) depends on p and the implications in practice. Why does 
the dependence on p disappear in the limit of large N/n? 


9. Cluster sampling: 


(a) Discuss the analysis of one-stage and two-stage cluster sampling designs using the 
notation of this chapter. What is the role of hierarchical models in analysis of data 
gathered by one- and two-stage cluster sampling? 

(b) Discuss the analysis of cluster sampling in which the clusters were sampled with prob- 
ability proportional to some measure of size, where the measure of size is known for 
all clusters, sampled and unsampled. In what way do the measures of size enter into 
the Bayesian analysis? 

See Kish (1965) and Lohr (2009) for thoughtful presentations of classical methods for 
design and analysis of such data. 

10. Cluster sampling: Suppose data have been collected using cluster sampling, but the 
details of the sampling have been lost, so it is not known which units in the sample came 
from common clusters. 

(a) Explain why an exchangeable but not independent and identically distributed model 
is appropriate. 

(b) Suppose the clusters are of equal size, with A clusters, each of size B, and the data 
came from a simple random sample of a clusters, with a simple random sample of 
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Number of phone lines 
Preference 1 2 3 A T 
Bush 557 38 4 3 7 
Dukakis 427 27 1 0 3 
No opinion/other 87 1 0 0 7 


Table 8.7 Respondents to the CBS telephone survey classified by opinion and number of residential 
telephone lines (category ‘?’ indicates no response to the number of phone lines question). 


b units within each cluster. Under what limits of a, A, b, and B can we ignore the 
cluster sampling in the analysis? 


11. Capture-recapture (see Seber, 1992, and Barry et al., 2003): a statistician/fisherman is 
interested in N, the number of fish in a certain pond. He catches 100 fish, tags them, 
and throws them back. A few days later, he returns and catches fish until he has caught 
20 tagged fish, at which point he has also caught 70 untagged fish. (That is, the second 
sample has 20 tagged fish out of 90 total.) 


(a) Assuming that all fish are sampled independently and with equal probability, give the 
posterior distribution for N based on a noninformative prior distribution. (You can 
give the density in unnormalized form.) 


(b) Briefly discuss your prior distribution and also make sure your posterior distribution 
is proper. 

(c) Give the probability that the next fish caught by the fisherman is tagged. Write the 
result as a sum or integral—you do not need to evaluate it, but the result should not 
be a function of N. 


(d) The statistician/fisherman checks his second catch of fish and realizes that, of the 20 
‘tagged’ fish, 15 are definitely tagged, but the other 5 may be tagged—he is not sure. 
Include this aspect of missing data in your model and give the new joint posterior 
density for all parameters (in unnormalized form). 


12. Sampling with unequal probabilities: Table 8.7 summarizes the opinion poll discussed 
in the examples in Sections 3.4 and 8.3, with the responses classified by presidential 
preference and number of telephone lines in the household. We shall analyze these data 
assuming that the probability of reaching a household is proportional to the number of 
telephone lines. Pretend that the responding households are a simple random sample of 
telephone numbers; that is, ignore the stratification discussed in Section 8.3 and ignore 
all nonresponse issues. 

(a) Set up parametric models for (i) preference given number of telephone lines, and (ii) 
distribution of number of telephone lines in the population. (Hint: for (i), consider 
the parameterization (8.8).) 

(b) What assumptions did you make about households with no telephone lines and house- 
holds that did not respond to the ‘number of phone lines’ question? 

(c) Write the joint posterior distribution of all parameters in your model. 

(d) Draw 1000 simulations from the joint distribution. (Use approximate computational 
methods.) 

(e) Compute the mean preferences for Bush, Dukakis, and no opinion/other in the popu- 
lation of households (not phone numbers!) and display a histogram for the difference 
in support between Bush and Dukakis. Compare to Figure 3.2 and discuss any differ- 
ences. 


(£) Check the fit of your model to the data using posterior predictive checks. 
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Number of Number of phone lines 
adults Preference 1 2 3 4 ? 
1 Bush 124 3 0 2 2 

Dukakis 134 2 0 0 0 
No opinion/other 32 0 0 0 1 
2 Bush 332 21 3 0 5 
Dukakis 229 15 0 0 3 
No opinion/other 47 0 O0 0 6 
3 Bush 71 9 1 0 0 
Dukakis 47 7T 1 0 0 
No opinion /other 4 1 0 0 0 
4 Bush 23 4 0 1 0 
Dukakis 11 3 0 0 0 
No opinion /other 3 0 0 0 0 
5 Bush 3 0 0 0 0 
Dukakis 4 0 0 0 0 
No opinion/other 1 0 0 0 0 
6 Bush 1 0 0 0 0 
Dukakis 1 0 0 0 0 
No opinion /other 0 0 0 0 0 
7 Bush 2 0 0 0 0 
Dukakis 0 0 0 0 0 
No opinion/other 0 0 0 0 0 
8 Bush 1 0 0 0 0 
Dukakis 0 0 0 0 0 
No opinion/other 0 0 0 0 0 
? Bush 0 1 0 0 0 
Dukakis 1 0 0 0 0 
No opinion /other 0 0 0 0 0 


Table 8.8 Respondents to the CBS telephone survey classified by opinion, number of residential 
telephone lines (category ‘?’ indicates no response to the number of phone lines question), and 
number of adults in the household (category ‘?’ includes all responses greater than 8 as well as 
nonresponses). 


13. 


14. 


(g) Explore the sensitivity of your results to your assumptions. 


Sampling with unequal probabilities (continued): Table 8.8 summarizes the opinion poll 
discussed in the examples in Sections 3.4 and 8.3, with the responses classified by vote 
preference, size of household, and number of telephone lines in the household. Analyze 
these data assuming that the probability of reaching an individual is proportional to the 
number of telephone lines and inversely proportional to the number of persons in the 
household. Use this additional information to obtain inferences for the mean preferences 
for Bush, Dukakis, and no opinion/other among individuals, rather than households, 
answering the analogous versions of questions (a)—(g) in the previous exercise. Compare 
to your results for the previous exercise and explain the differences. (A complete analysis 
would require the data also cross-classified by the 16 strata in Table 8.2 as well as 
demographic data such as sex and age that affect the probability of nonresponse.) 


Rounded data: the last two columns of Table 2.2 on page 59 give data on passenger 
airline deaths and deaths per passenger mile flown. We would like to divide these to 
obtain the number of passenger miles flown in each year, but the ‘per mile’ data are 
rounded. (For the purposes of this exercise, ignore the column in the table labeled ‘Fatal 
accidents.’) 


(a) Using just the data from 1976 (734 deaths, 0.19 deaths per 100 million passenger 
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miles), obtain inference for the number of passenger miles flown in 1976. Give a 95% 
posterior interval (you may do this by simulation). Clearly specify your model and 
your prior distribution. 

(b) Apply your method to obtain intervals for the number of passenger miles flown each 
year until 1985, analyzing the data from each year separately. 


(c) Now create a model that allows you to use data from all the years to estimate jointly 
the number of passenger miles flown each year. Estimate the model and give 95% 
intervals for each year. (Use approximate computational methods.) 


(d) Describe how you would use the results of this analysis to get a better answer for 
Exercise 2.13. 


15. Sequential treatment assignment: consider a medical study with two treatments, in which 
the subjects enter the study one at a time. As the subjects enter, they must be assigned 
treatments. Efron (1971) evaluates the following ‘biased-coin’ design for assigning treat- 
ments: each subject is assigned a treatment at random with probability of receiving 
treatment depending on the treatment assignments of the subjects who have previously 
arrived. If equal numbers of previous subjects have received each treatment, then the 
current subject is given the probability 4 of receiving each treatment; otherwise, he or 
she is given the probability p of receiving the treatment that has been assigned to fewer 
of the previous subjects, where p is a fixed value between 4 and 1. 


(a) What covariate must be recorded on the subjects for this design to be ignorable? 
(b 


) Outline how you would analyze data collected under this design. 
(c) To what aspects of your model is this design sensitive? 
) 


(d) Discuss in Bayesian terms the advantages and disadvantages of the biased-coin design 
over the following alternatives: (i) independent randomization (that is, p = $ in the 
above design), (ii) randomized blocks where the blocks consist of successive pairs of 
subjects (that is, p = 1 in the above design). Be aware of the practical complications 
discussed in Section 8.5. 


16. Randomized experiment with noncompliance: Table 8.5 on page 223 gives data from the 
study of vitamin A in Indonesia described in the example on page 223 (see Imbens and 
Rubin, 1997). 

(a) Is treatment assignment ignorable? Strongly ignorable? Known? 

(b) Estimate the intention-to-treat effect: that is, the effect of assigning the treatment, 
irrespective of compliance, on the entire population. 

(c) Give the simple instrumental variables estimate of the average effect of the treatment 
for the compliers. 


(d) Write the likelihood, assuming compliance status is known for all units. 


17. Data structure and data analysis: 

An experiment is performed comparing two treatments applied to the growth of cell cul- 
tures. The cultures are in dishes, with six cultures to a dish. The researcher applies each 
treatment to five dishes in a simple unpaired design and then considers two analyses: (i) 
n = 30 for each treatment, assuming there is no dependence among the outcomes within 
a dish, so that the observations for each treatment can be considered as 30 independent 
data points; or (ii) n = 5 for each treatment, allowing for the possibility of dependence 
by using, for each dish, the mean of the six outcomes within the dish and then modeling 
the 5 dishes as independent. 

In either case, assume the researcher is doing the simple classical estimate. Thus, the 
estimated treatment effect is the same under either analysis—it is the average of the mea- 
surements under one treatment minus the average under the other. The only difference 
is whether the measurements are considered as clustered. 


This electronic edition is for non-commercial purposes only. 


236 8. MODELING ACCOUNTING FOR DATA COLLECTION 


The researcher suspects that the outcomes within each dish are independent: the cell 
cultures are far enough apart that they are not physically interacting, and the experiment 
is done carefully enough that there is no reason to suspect there are ‘dish effects.’ That 
said, the data are clustered, so dish effects are a possibility. Further suppose that it is 
completely reasonable to consider the different dishes as independent. 

The researcher reasons as follows: the advantage of method (i) is that, with 30 obser- 
vations instead of just 5 per treatment, the standard errors of the estimates should be 
much smaller. On the other hand, method (ii) seems like the safer approach as it should 
work even if there are within-dish correlations or dish effects. 


(a) Which of these two analyses should the researcher do? Explain your answer in two or 
three sentences. 
(b) Write a model that you might use for a Bayesian analysis of this problem. 
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Chapter 9 


Decision analysis 


What happens after the data analysis, after the model has been built and the inferences 
computed, and after the model has been checked and expanded as necessary so that its 
predictions are consistent with observed data? What use is made of the inferences once the 
data analysis is done? 

One form of answer to this question came in Chapter 8, which discussed the necessity of 
extending the data model to encompass the larger population of interest, including missing 
observations, unobserved outcomes of alternative treatments, and units not included in the 
study. For a Bayesian model to generalize, it must account for potential differences between 
observed data and the population. 

This chapter considers a slightly different aspect of the problem: how can inferences be 
used in decision making? In a general sense, we expect to be using predictive distributions, 
but the details depend on the particular problem. In Section 9.1 we outline the theory of 
decision making under uncertainty, and the rest of the chapter presents examples of the 
application of Bayesian inference to decisions in social science, medicine, and public health. 

The first example, in Section 9.2, is the simplest: we fit a hierarchical regression on the 
effects of incentives on survey response rates, and then we use the predictions to estimate 
costs. The result is a graph estimating expected increase in response rate vs. the additional 
cost required, which allows us to apply general inferences from the regression model to mak- 
ing decisions for a particular survey of interest. From a decision-making point of view, this 
example is interesting because regression coefficients that are not ‘statistically significant’ 
(that is, that have high posterior probabilities of being positive or negative) are still highly 
relevant for the decision problem, and we cannot simply set them to zero. 

Section 9.3 presents a more complicated decision problem, on the option of performing 
a diagnostic test before deciding on a treatment for cancer. This is a classic problem of the 
‘value of information,’ balancing the risks of the screening test against the information that 
might lead to a better treatment decision. The example presented here is typical of the 
medical decision-making literature in applying a relatively sophisticated Bayesian decision 
analysis using point estimates of probabilities and risks taken from simple summaries of 
published studies. 

The example in Section 9.4 combines Bayesian hierarchical modeling, probabilistic de- 
cision analysis, and utility analysis, balancing the risks of exposure to radon gas against 
the costs of measuring the level of radon in a house and potentially remediating it. We see 
this analysis as a prototype of full integration of inference with decision analysis, beyond 
what is practical or feasible for most applications but indicating the connections between 
Bayesian hierarchical regression modeling and individually focused decision making. 


9.1 Bayesian decision theory in different contexts 


Many if not most statistical analyses are performed for the ultimate goal of decision making. 
In most of this book we have left the decision-making step implicit: we perform a Bayesian 
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analysis, from which we can summarize posterior inference for quantities of interest such as 
the probability of dying of cancer, or the effectiveness of a medical treatment, or the vote 
by state in the next Presidential election. 

When explicitly balancing the costs and benefits of decision options under uncertainty, 
we use Bayesian inference in two ways. First, a decision will typically depend on predictive 
quantities (for example, the probability of recovery under a given medical treatment, or the 
expected value of a continuous outcome such as cost or efficacy under some specified inter- 
vention) which in turn depend on unknown parameters such as regression coefficients and 
population frequencies. We use posterior inferences to summarize our uncertainties about 
these parameters, and hence about the predictions that enter into the decision calculations. 
We give examples in Sections 9.2 and 9.4, in both cases using inferences from hierarchical 
regressions. 

The second way we use Bayesian inference is within a decision analysis, to determine the 
conditional distribution of relevant parameters and outcomes, given information observed 
as a result of an earlier decision. This sort of calculation arises in multistage decision trees, 
in particular when evaluating the expected value of information. We illustrate with a simple 
case in Section 9.3 and a more elaborate example in Section 9.4. 


Bayesian inference and decision trees 


Decision analysis is inherently more complicated than statistical inference because it involves 

optimization over decisions as well as averaging over uncertainties. We briefly lay out the 

elements of Bayesian decision analysis here. The implications of these general principles 
should become clear in the examples that follow. 
Bayesian decision analysis is defined mathematically by the following steps: 

1. Enumerate the space of all possible decisions d and outcomes x. In a business context, 
x might be dollars; in a medical context, lives or life-years. More generally, outcomes 
can have multiple attributes and would be expressed as vectors. Section 9.2 presents an 
example in which outcomes are in dollars and survey response rates, and in the example of 
Section 9.4, outcomes are summarized as dollars and lives. The vector of outcomes x can 
include observables (that is, predicted values y in our usual notation) as well as unknown 
parameters 6. For example, x could include the total future cost of an intervention (which 
would ultimately be observed) as well as its effectiveness in the population (which might 
never be measured). 


2. Determine the probability distribution of x for each decision option d. In Bayesian terms, 
this is the conditional posterior distribution, p(z|d). In the decision-analytic framework, 
the decision d does not have a probability distribution, and so we cannot speak of p(d) 
or p(x); all probabilities must be conditional on d. 


3. Define a utility function U(x) mapping outcomes onto the real numbers. In simple 
problems, utility might be identified with a single continuous outcome of interest x, such 
as years of life, or net profit. If the outcome x has multiple attributes, the utility function 
must trade off different goods, for example quality-adjusted life-years (in Section 9.3). 


4. Compute the expected utility E(U(a)|d) as a function of the decision d, and choose the 
decision with highest expected utility. In a decision tree—in which a sequence of two or 
more decisions might be taken—the expected utility must be calculated at each decision 
point, conditional on all information available up to that point. 

A full decision analysis includes all four of these steps, but in many applications, we 
simply perform the first two, leaving it to decision makers to balance the expected gains of 
different decision options. 

This chapter includes three case studies of the use of Bayesian inference for decision 
analysis. In Section 9.2, we present an example in which decision making is carried halfway: 
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we consider various decision options and evaluate their expected consequences, but we do 
not create a combined utility function or attempt to choose an optimal decision. Section 
9.3 analyzes a more complicated decision problem that involves conditional probability and 
the value of information. We conclude in Section 9.4 with a full decision analysis including 
utility maximization. 


Summarizing inference and model selection 


When we cannot present the entire posterior distribution, for example, due to reasons of 
decision making, market mechanisms, reporting requirements, or communication, we have 
to make a decision of how to summarize the inference. We discussed generic summaries in 
Chapter 2, but it is also possible to formulate the choice of summary as a decision problem. 
There are various utilities called scoring functions for point predictions and scoring rules 
for probabilistic predictions, which were briefly discussed in Chapter 7. The usual point 
and interval summaries have all corresponding scoring function or rule. Optimal summary 
resulting from the logarithmic score used in model assessment in Chapter 7 is to report the 
entire posterior (predictive) distribution. 

Model selection can be also formulated as a decision problem: is the predictive perfor- 
mance of an expanded model significantly better? Often this decision is made informally, 
but a more formal decision can be made, for example, if there are data collection costs and 
cost of measuring less relevant explanatory variables may overcome the benefits of getting 
better predictions. 


9.2 Using regression predictions: survey incentives 


Our first example shows the use of a meta-analysis—fit from historical data using hierar- 
chical linear regression—to estimate predicted costs and benefits for a new situation. The 
decision analysis for this problem is implicit, but the decision-making framework makes it 
clear why it can be important to include predictors in a regression model even when they 
are not statistically significant. 


Background on survey incentives 


Common sense and evidence (in the form of randomized experiments within surveys) both 
suggest that giving incentives to survey participants tends to increase response rates. From 
a survey designer’s point of view, the relevant questions are: 


e Do the benefits of incentives outweigh the costs? 


e If an incentive is given, how and when should it be offered, whom should it be offered 
to, what form should it take, and how large should its value be? 


We consider these questions in the context of the New York City Social Indicators Survey, 
a telephone study conducted every two years that has had a response rate below 50%. Our 
decision analysis proceeds in two steps: first, we perform a meta-analysis to estimate the 
effects of incentives on response rate, as a function of the amount of the incentive and the 
way it is implemented. Second, we use this inference to estimate the costs and benefits of 
incentives in our particular survey. 

We consider the following factors that can affect the efficacy of an incentive: 


e The value of the incentive (in tens of 1999 dollars), 

e The timing of the incentive payment (given before the survey or after), 
e The form of the incentive (cash or gift), 

e The mode of the survey (face-to-face or telephone), 
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Figure 9.1 Observed increase z; in response rate vs. the increased dollar value of incentive compared 
to the control condition, for experimental data from 89 surveys. Prepaid and postpaid incentives are 
indicated by closed and open circles, respectively. (The graphs show more than 89 points because 
many surveys had multiple treatment conditions.) The lines show expected increases for prepaid 
(solid lines) and postpaid (dashed lines) cash incentives as estimated from a hierarchical regression. 


e The burden, or effort, required of the survey respondents (a survey is characterized as 
high burden if it is over one hour long and has sensitive or difficult questions, and low 
burden otherwise). 


Data from 39 experiments 


Data were collected on 39 surveys that had embedded experiments testing different incen- 
tive conditions. For example, a survey could, for each person contacted, give a $5 prepaid 
incentive with probability 1/3, a $10 prepaid incentive with probability 1/3, or no incentive 
with probability 1/3. The surveys in the meta-analysis were conducted on different popula- 
tions and at different times, and between them they covered a range of different interactions 
of the five factors mentioned above (value, timing, form, mode, and burden). In total, the 
39 surveys include 101 experimental conditions. We use the notation y; to indicate the 
observed response rate for observation i = 1,...,101. 

A reasonable starting point uses the differences, z; = y; — y?, where y? corresponds 
to the lowest-valued incentive condition in the survey that includes condition i (in most 
surveys, this is simply the control case of no incentive). Working with z; reduces the 
number of cases in the analysis from 101 conditions to 62 differences and eliminates the 
between-survey variation in baseline response rates. 

Figure 9.1 displays the difference in response rates z; vs. the difference in incentive 
amounts, for each of the 62 differences i. The points are partitioned into subgraphs cor- 
responding to the mode and burden of their surveys. Within each graph, solid and open 
circles indicate prepaid and postpaid incentives, respectively. We complete the graphs by 
including a dotted line at zero, to represent the comparison case of no incentive. (The 
graphs also include fitted regression lines from the hierarchical model described below.) 
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It is clear from the graphs in Figure 9.1 that incentives generally have positive effects, and 
that prepaid incentives tend to be smaller in dollar value. Some of the observed differences 
are negative, but this can be expected from sampling variability, given that some of the 39 
surveys are fairly small. 

A natural way to use these data to support the planning of future surveys is to fit a 
classical regression model relating z; to the value, timing, and form of incentive as well 
as the mode and burden of survey. However there are a number of difficulties with this 
approach. From the sparse data, it is difficult to estimate interactions which might well 
be important. For example, it seems reasonable to expect that the effect of prepaid versus 
postpaid incentive may depend on the amount of the incentive. In addition, a traditional 
regression would not reflect the hierarchical structure of the data: the 62 differences are 
clustered in 39 surveys. It is also not so easy in a regression model to account for the 
unequal sample sizes for the experimental conditions, which range from below 100 to above 
2000. A simple weighting proportional to sample size is not appropriate since the regression 
residuals include model error as well as binomial sampling error. 

We shall set up a slightly more elaborate hierarchical model because, for the purpose of 
estimating the costs and benefits in a particular survey, we need to estimate interactions 
in the model (for example, the interaction between timing and value of incentive), even if 
these are not statistically significant. 


Setting up a Bayesian meta-analysis 


We set up a hierarchical model with 101 data points 7, nested within 39 surveys j. We start 
with a binomial model relating the number of respondents, n;, to the number of persons 
contacted, N; (thus, yi = n;/N;), and the population response probabilities 7;: 


ni ~ Bin(N;, Ti). (9.1) 


The next stage is to model the probabilities 7; in terms of predictors X, including an 
indicator for survey incentives, the five incentive factors listed above, and various inter- 
actions. In general it would be advisable to use a transformation before modeling these 
probabilities since they are constrained to lie between 0 and 1. However, in our particular 
application area, response probabilities in telephone and face-to-face surveys are far enough 
from 0 and 1 that a linear model is acceptable: 


ti ~ N(XiB + aja 07). (9.2) 


Here, X; is the linear predictor for the condition corresponding to data point i, œj) is 
a random effect for the survey j = 1,...,39 (necessary in the model because underlying 
response rates vary greatly), and o represents the lack of fit of the linear model. We 
use the notation j(i) because the conditions 7 are nested within surveys j. The use of the 
survey random effects allows us to incorporate the 101 conditions in the analysis rather than 
working with the 62 differences as was done earlier. The a;’s also address the hierarchical 
structure of the data. 

Modeling (9.2) on the untransformed scale is not simply an approximation but rather a 
choice to set up a more interpretable model. Switching to the logistic, for example, would 
have no practical effect on our conclusions, but it would make all the regression coefficients 
much more difficult to interpret. 

We next specify prior distributions for the parameters in the model. We model the 
survey-level random effects a; using a normal distribution: 


a; ~ N(0,7?). (9.3) 


There is no loss of generality in assuming a zero mean for the a,;’s if a constant term 
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is included in the set of predictors X. Finally, we assign uniform prior densities to the 
standard deviations g and 7 and to the regression coefficients 6. The parameters o and T 
are estimated precisely enough that the inferences are not sensitive to the particular choice 
of noninformative prior distribution. 


Inferences from the model 


Thus far we have not addressed the choice of variables to include in the matrix of predictors, 
X. The main factors are those described earlier (which we denote as Value, Timing, Form, 
Mode, and Burden) along with Incentive, an indicator for whether a given condition includes 
an incentive (not required when we were working with the differences). Because there are 
restrictions (Value, Timing, and Form are only defined if Incentive = 1), there are 36 possible 
regression predictors, including the constant term and working up to the interaction of all 
five factors with incentive. The number of predictors would increase if we allowed for 
nonlinear functions of incentive value. 


Of the predictors, we are particularly interested in those that include interactions with 
the Incentive indicator, since these indicate effects of the various factors. The two-way 
interactions in the model that include Incentive can thus be viewed as main effects of the 
factors included in the interactions, the three-way interactions can be viewed as two-way 
interactions of the included factors, and so forth. 


We fit a series of models, starting with the simplest, then adding interactions until we 
pass the point where the existing data could estimate them effectively, then finally choosing a 
model that includes the key interactions needed for our decision analysis. Our chosen model 
includes the main effects for Mode, Burden, and the Mode x Burden interaction, which all 
have the anticipated large impacts on the response rate of a survey. It also includes Incentive 
(on average, the use of an incentive increases the response rate by around 3 percentage 
points), all two-way interactions of Incentive with the other factors, and the three-way 
interactions that include Incentive x Value interacting with Timing and Burden. We do 
not provide detailed results here, but some of the findings are that an extra $10 in incentive is 
expected to increase the response rate by 3—4 percentage points, cash incentives increase the 
response rate by about 1 percentage point relative to noncash, prepaid incentives increase 
the response rate by 1—2 percentage points relative to postpaid, and incentives have a bigger 
impact (by about 5 percentage points) on high-burden surveys compared to low-burden 
surveys. 


The within-study standard deviation o is around 3 or 4 percentage points, indicating 
the accuracy with which differential response rates can be predicted within any survey. 
The between-study standard deviation 7 is about 18 percentage points, indicating that the 
overall response rates vary greatly, even after accounting for the survey-level predictors 
(Mode, Burden, and their interaction). 


Figure 9.1 on page 240 displays the model fit as four graphs corresponding to the two 
possible values of the Burden and Mode variables. Within each graph, we display solid lines 
for the prepaid condition and dotted lines for postpaid incentives, in both cases showing 
only the results with cash incentives, since these were estimated to be better than gifts of 
the same value. 


To check the fit, we display in Figure 9.2 residual plots of prediction errors for the 
individual data points y;, showing telephone and face-to-face surveys separately and, as 
with the previous plots, using symbols to distinguish pre- and post-incentives. There are 
no apparent problems with the basic fit of the model, although other models could also fit 
these data equally well. 
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Figure 9.2 Residuals of response rate meta-analysis data plotted vs. predicted values. Residuals for 
telephone and face-to-face surveys are shown separately. As in Figure 9.1, solid and open circles 
indicate surveys with prepaid and postpaid incentives, respectively. 


Inferences about costs and response rates for the Social Indicators Survey 


The Social Indicators survey is a low-burden telephone survey. If we use incentives at all, 
we would use cash, since this appears to be more effective than gifts of the same value. We 
then have the choice of value and timing of incentives. Regarding timing, prepaid incentives 
are more effective than postpaid incentives per dollar of incentive (compare the slopes of 
the solid and dashed lines in Figure 9.1). But this does not directly address our decision 
problem. Are prepaid incentives still more effective than postpaid incentives when we look 
at total dollars spent? This is not immediately clear, since prepaid incentives must be sent 
to all potential respondents, whereas postpaid are given only to the people who actually 
respond. It can be expensive to send the prepaid incentives to the potential respondents 
who cannot be reached, refuse to respond, or are eliminated in the screening process. 

We next describe how the inferences are used to inform decisions in the context of the 
Social Indicators Survey. This survey was conducted by random digit dialing in two parts: 
750 respondents came from an ‘individual survey,’ in which an attempt was made to survey 
an adult from every residential phone number that is called, and 1500 respondents came 
from a ‘caregiver survey,’ which included only adults who are taking care of children. The 
caregiver survey began with a screening question to eliminate households without children. 

For each of the two surveys, we use our model to estimate the expected increase in 
response rate for any hypothesized incentive and the net increase in cost to obtain that 
increase in response rate. It is straightforward to use the fitted hierarchical regression 
model to estimate the expected increase in response rate. Then we work backward and 
estimate the number of telephone calls required to reach the same number of respondents 
with this higher response rate. The net cost of the hypothesized incentive is the dollar 
value of the incentive (plus $1.25 to account for the cost of processing and mailing) times 
the number of people to whom the incentive is sent less the savings that result because 
fewer contacts are required. 

For example, consider a $5 postpaid incentive for the caregiver survey. From the fitted 
model, this would lead to an expected increase of 1.5% in response rate, which would 
increase it from the 38.9% in the actual survey to a hypothesized 40.4%. The cost of the 
postpaid incentives for 1500 respondents at $6.25 each ($5 incentive plus $1.25 processing 
and mailing cost) is $9375. With the number of responses fixed, the increased response rate 
implies that only 1500/0.404 = 3715 eligible households would have to be reached, instead 
of the 3856 households contacted in the actual survey. Propagating back to the screening 
stage leads to an estimated number of telephone numbers that would need to be contacted 
and an estimated number of calls to reach those numbers. In this case we estimate that 3377 
fewer calls would be required, yielding an estimated savings of $2634 (based on the cost of 
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Figure 9.3 Expected increase in response rate vs. net added cost per respondent, for prepaid (solid 
lines) and postpaid (dotted lines) incentives, for surveys of individuals and caregivers. On each plot, 
heavy lines correspond to the estimated effects, with light lines showing +1 standard error bounds. 
The numbers on the lines indicate incentive payments. At zero incentive payments, estimated effects 
and costs are nonzero because the models have nonzero intercepts (corresponding to the effect of 
making any contact at all) and we are assuming a $1.25 mailing and processing cost per incentive. 


interviewers and the average length per non-interview call). The net cost of the incentive is 
then $9375 — $2634 = $6741, which when divided by the 1500 completed interviews yields 
a cost of $4.49 per interview for this 1.5% increase in response rate. We perform similar 
calculations for other hypothesized incentive conditions. 

Figure 9.3 summarizes the results for a range of prepaid and postpaid incentive val- 
ues, assuming we would spend up to $20 per respondent in incentives. For either survey, 
incentives are expected to raise response rates by only a few percentage points. Prepaid 
incentives are expected to be slightly better for the individual survey, and postpaid are 
preferred for the (larger) caregiver survey. For logistical reasons, we would use the same 
form of incentive for both, so we recommend postpaid. In any case, we leave the final step 
of the decision analysis—picking the level of the incentive—to the operators of the survey, 
who must balance the desire to increase response rate with the cost of the incentive itself. 


Loose ends 


Our study of incentives is far from perfect; we use it primarily to demonstrate how a 
relatively routine analysis can be used to make inferences about the potential consequences 
of decision options. The most notable weaknesses are the high level of uncertainty about 
individual coefficients (not shown here) and the arbitrariness of the decision as to which 
interactions should be included/excluded. These two problems go together: when we tried 
including more interactions, the standard errors became even larger and the inferences 
became less believable. The problem is with the noninformative uniform prior distribution 
on the coefficients. It would make more sense to include all interactions and make use of 
prior information that might shrink the higher-order interactions without fixing them at 
zero. It would also be reasonable to allow the effects of incentives to vary among surveys. 
We did not expand the model in these ways because we felt we were at the limit of our 
knowledge about this problem, and we thought it better to stop and summarize our inference 
and uncertainties about the costs and benefits of incentives. 

Another weakness of the model is its linearity, which implies undiminishing effects as 
incentives rise. It would be possible to add an asymptote to the model to fix this, but we 
do not do so, since in practice we would not attempt to extrapolate our inferences beyond 
the range of the data in the meta-analysis (prepaid incentives up to $20 and postpaid up 
to $60 or $100; see Figure 9.1). 
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9.3 Multistage decision making: medical screening 


Decision analysis becomes more complicated when there are two or more decision points, 
with later decisions depending on data gathered after the first decision has been made. Such 
decision problems can be expressed as trees, alternating between decision and uncertainty 
nodes. In these multistage problems, Bayesian inference is particularly useful in updating 
the state of knowledge with the information gained at each step. 


Example with a single decision point 


We illustrate with a simplified example from the medical decision making literature. A 
95-year-old man with an apparently malignant tumor in the lung must decide between 
the three options of radiotherapy, surgery, or no treatment. The following assumptions 
are made about his condition and life expectancy (in practice, these probabilities and life 
expectancies are based on extrapolations from the medical literature): 
e There is a 90% chance that the tumor is malignant. 
e If the man does not have lung cancer, his life expectancy is 34.8 months. 
e If the man does have lung cancer, 

1. With radiotherapy, his life expectancy is 16.7 months. 

2. With surgery, there is a 35% chance he will die immediately, but if he survives, his 

life expectancy is 20.3 months. 

3. With no treatment, his life expectancy is 5.6 months. 
Aside from mortality risk, the treatments themselves cause considerable discomfort for 
slightly more than a month. We shall determine the decision that maximizes the patient’s 
quality-adjusted life expectancy, which is defined as the expected length of time the patient 
survives, minus a month if he goes through one of the treatments. The subtraction of a 
month addresses the loss in ‘quality of life’ due to treatment-caused discomfort. 

Quality-adjusted life expectancy under each treatment is then 

1. With radiotherapy: 0.9 - 16.7 + 0.1 - 34.8 — 1 = 17.5 months. 
2. With surgery: 0.35 - 0 + 0.65 - (0.9 - 20.3 + 0.1 - 34.8 — 1) = 13.5 months. 
3. With no treatment: 0.9 - 5.6 + 0.1 - 34.8 = 8.5 months. 


These simple calculations show radiotherapy to be the preferred treatment for this 95-year- 
old man. 


Adding a second decision point 


The problem becomes more complicated when we consider a fourth decision option, which 
is to perform a test to see if the cancer is truly malignant. The test, called bronchoscopy, 
is estimated to have a 70% chance of detecting the lung cancer if the tumor is indeed 
malignant, and a 2% chance of falsely finding cancer if the tumor is actually benign. In 
addition, there is an estimated 5% chance that complications from the test itself will kill 
the patient. 

Should the patient choose bronchoscopy? To make this decision, we must first determine 
what he would do after the test. Bayesian inference with discrete probabilities gives the 
probability of cancer given the test result T as 
Petira = Pr(cancer)p(T|cancer) . 

Pr(cancer)p(T|cancer) + Pr(no cancer)p(T'|no cancer) 
and we can use this conditional probability in place of the prior probability Pr(cancer) = 0.9 
in the single-decision-point calculations above. 
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e If the test is positive for cancer, then the patient’s updated probability of cancer is 
es = 0.997, and his quality-adjusted life expectancy under each of the three 
treatments becomes 


1. With radiotherapy: 0.997 - 16.7 + 0.003 - 34.8 — 1 = 15.8 months. 
2. With surgery: 0.35 - 0 + 0.65(0.997 - 20.3 + 0.003 - 34.8 — 1) = 12.6 months. 
3. With no treatment: 0.997 - 5.6 + 0.003 - 34.8 = 5.7 months. 


So, if the test is positive, radiotherapy would be the best treatment, with a quality- 
adjusted life expectancy of 15.8 months. 


e If the test is negative for cancer, then the patient’s updated probability of cancer is 
TRE SSR ROK = 0.734, and his quality-adjusted life expectancy under each of the three 
treatments becomes 


1. With radiotherapy: 0.734 - 16.7 + 0.266 - 34.8 — 1 = 20.5 months. 
2. With surgery: 0.35 - 0 + 0.65(0.734 - 20.3 + 0.266 - 34.8 — 1) = 15.1 months. 
3. With no treatment: 0.734 - 5.6 + 0.266 - 34.8 = 13.4 months. 


If the test is negative, radiotherapy would still be the best treatment, this time with a 
quality-adjusted life expectancy of 20.5 months. 


At this point, it is clear that bronchoscopy is not a good idea, since whichever way the 
treatment goes, it will not affect the decision that is made. To complete the analysis, 
however, we work out the quality-adjusted life expectancy for this decision option. The 
bronchoscopy can yield two possible results: 


e Test is positive for cancer. The probability of this outcome is 0.9-0.7+0.1-0.02 = 0.632, 
and the quality-adjusted life expectancy (accounting for the 5% chance that the test can 
be fatal) is 0.95 - 15.8 = 15.0 months. 


e Test is negative for cancer. The probability of this outcome is 0.9-0.3+0.1-0.98 = 0.368, 
and the quality-adjusted life expectancy (accounting for the 5% chance that the test can 
be fatal) is 0.95 - 20.5 = 19.5 months. 


The total quality-adjusted life expectancy for the bronchoscopy decision is then 0.632 - 
15.0 + 0.368 - 19.5 = 16.6 months. Since radiotherapy without a bronchoscopy yields an 
expected quality-adjusted survival of 17.5 months, it is clear that the patient should not 
choose bronchoscopy. 

The decision analysis reveals the perhaps surprising result that, in this scenario, bron- 
choscopy is pointless, since it would not affect the decision that is to be made. Any other 
option (for example, bronchoscopy, followed by a decision to do radiotherapy if the test is 
positive or do no treatment if the test is negative) would be even worse in expected value. 


9.4 Hierarchical decision analysis for home radon 


Associated with many household environmental hazards is a decision problem: whether to 
(1) perform an expensive remediation to reduce the risk from the hazard, (2) do nothing, 
or (3) take a relatively inexpensive measurement to assess the risk and use this information 
to decide whether to (a) remediate or (b) do nothing. This decision can often be made at 
the individual, household, or community level. Performing this decision analysis requires 
estimates for the risks. Given the hierarchical nature of the decision-making units, individ- 
uals are grouped within households which are grouped within counties, and so forth, it is 
natural to use a hierarchical model to estimate the risks. 

We illustrate with the example of risks and remediation for home radon exposure. We 
provide a fair amount of background detail to make sure that the context of the decision 
analysis is clear. 
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Figure 9.4 Lifetime added risk of lung cancer, as a function of average radon exposure in picoCuries 
per liter (pCi/L). The median and mean radon levels in ground-contact houses in the U.S. are 0.67 
and 1.8 pCi/L, respectively, and over 50,000 homes have levels above 20 pCi/L. 


Background 


Radon is a carcinogen—a naturally occurring radioactive gas whose decay products are 
also radioactive—known to cause lung cancer in high concentration, and estimated to cause 
several thousand lung cancer deaths per year in the U.S. Figure 9.4 shows the estimated 
additional lifetime risk of lung cancer death for male and female smokers and nonsmokers, 
as a function of average radon exposure. At high levels, the risks are large, and even the 
risks at low exposures are not trivial when multiplied by the millions of people affected. 

The distribution of annual-average living area home radon concentrations in U.S. houses, 
as measured by a national survey (described in more detail below), is approximately log- 
normal with geometric mean 0.67 pCi/L and geometric standard deviation 3.1 (the median 
of this distribution is 0.67 pCi/L and the mean is 1.3 pCi/L). The vast majority of houses 
in the U.S. do not have high radon levels: about 84% have concentrations under 2 pCi/L, 
and about 90% are below 3 pCi/L. However, the survey data suggest that between 50,000 
and 100,000 homes have radon concentrations in primary living space in excess of 20 pCi/L. 
This level causes an annual radiation exposure roughly equal to the occupational exposure 
limit for uranium miners. 

Our decision problem includes as one option measuring the radon concentration and 
using this information to help decide whether to take steps to reduce the risk from radon. 
The most frequently used measurement protocol in the U.S. has been the ‘screening’ mea- 
surement: a short-term (2-7 day) charcoal-canister measurement made on the lowest level 
of the home (often an unoccupied basement), at a cost of about $15 to $20. Because they 
are usually made on the lowest level of the home (where radon levels are highest), short- 
term measurements are upwardly biased measures of annual living area average radon level. 
The magnitude of this bias varies by season and by region of the country and depends on 
whether the basement (if any) is used as living space. After correcting for biases, short- 
term measurements in a house have approximate lognormal distributions with geometric 
standard deviation of roughly 1.8. 

A radon measure that is far less common than the screening measurement, but is much 
better for evaluating radon risk, is a 12-month integrated measurement of the radon con- 
centration. These long-term observations directly measure the annual living-area average 
radon concentration, with a geometric standard deviation of about 1.2, at a cost of about 
$50. In the discussion below we find that long-term measurements are more effective, in a 
cost-benefit sense, than short-term measurements. 

If the radon level in a home is sufficiently high, then an individual may take action 
to control the risk due to radon. Several radon control or remediation techniques have 
been developed, tested, and implemented. The currently preferred remediation method 
for most homes, ‘sub-slab depressurization,’ seals the floors and increases ventilation, at a 
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cost of about $2000, including additional heating and cooling costs. Studies suggest that 
almost all homes can be remediated to below 4 pCi/L, while reductions under 1 pCi/L are 
rarely attained with conventional methods. For simplicity, we make the assumption that 
remediation will reduce radon concentration to 2 pCi/L. For obvious reasons, little is known 
about effects of remediation on houses that already have low radon levels; we assume that if 
the initial annual living area average level is less than 2 pCi/L, then remediation will have 
no effect. 


The individual decision problem 


We consider the individual homeowner to have three options: 


1. Remediate without monitoring: spend the $2000 to remediate the home and reduce radon 
exposure to 2 pCi/L. 


2. Do nothing and accept the current radon exposure. 


3. Take a long-term measurement of your home at a cost of $50. Based on the result of the 
measurement, decide whether to remediate or do nothing. 


As described above, a short-term measurement is another possibility, but in our analysis 
we find this not to be cost-effective. 

The measurement /remediation decision must generally be made under uncertainty, be- 
cause most houses have not been measured for radon. Even after measurement, the radon 
level is not known exactly—just as in the cancer treatment example in Section 9.3, the 
cancer status is not perfectly known even after the test. The decision analysis thus presents 
two challenges: first, deciding whether to remediate if the radon exposure were known; and 
second, deciding whether it is worth it to measure radon exposure given the current state of 
knowledge about home radon—that is, given the homeowner’s prior distribution. This prior 
distribution is not a subjective quantity; rather, we determine it by a hierarchical analysis 
of a national sample of radon measurements, as we discuss below. 


Decision-making under certainty 


Before performing the statistical analysis, we investigate the optimal decision for the home- 
owner with a known radon exposure. The problem is difficult because it trades off dollars 
and lives. 

We express decisions under certainty in terms of three quantities, equivalent under a 
linear no-threshold dose-response relationship: 


1. Da, the dollar value associated with a reduction of 10~® in probability of death from 
lung cancer (essentially the value of a ‘microlife’); 


2. D,, the dollar value associated with a reduction of 1 pCi/L in home radon level for a 
30-year period; 

3. Raction, the home radon level above which you should remediate if your radon level is 
known. 


The dollar value of radon reduction, D,, depends on the number of lives (or microlives) 
saved by a drop in the radon level. This in turn depends on a variety of factors including 
the number, gender and smoking status of household occupants as well as the decrease in 
cancer risk due to the decrease in radon exposure. We do not discuss the details of such 
a calculation here but only report that for a ‘typical’ U.S. household (one with an average 
number of male and female smokers and nonsmokers) D, = 4800 Da. The appropriate radon 
level to act upon, Raction, depends on the dollar value of radon reduction and the benefits of 
remediation. We assume that remediation takes a house’s annual-average living-area radon 
level down to a level Rremea = 2 pCi/L if it was above that, but leaves it unchanged if it 
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was below that. Then the action level is determined as the value at which the benefit of 
remediation ($D;(Raction — Rremea)) is equal to the cost ($2000), 


$2000 


ee 


Raction = F Ryemed: (9.4) 
The U.S., English, Swedish, and Canadian governments recommend remediation levels of 
Raction = 4, 5, 10, and 20 pCi/L, which, with Rremea = 2 pCi/L, correspond to equivalent 
costs per pCi/L of D, = $1000, $670, $250, and $111, respectively. For an average U.S. 
household this implies dollar values per microlife of Dg = $0.21, $0.14, $0.05, and $0.02, 
respectively. 

From the risk assessment literature, typical values of Dg for medical interventions are 
in the range $0.10 to $0.50. Higher values are often attached to life in other contexts (for 
example, jury awards for deaths due to negligence). The lower values seem reasonable in 
this case because radon remediation, like medical intervention, is voluntary and addresses 
reduction of future risk rather than compensation for current loss. 

With these as a comparison, the U.S. and English recommendations for radon action 
levels correspond to the low end of the range of acceptable risk-reduction expenditures. The 
Canadian and Swedish recommendations are relatively cavalier about the radon risk, in the 
sense that the implied dollar value per microlife is lower than ordinarily assumed for other 
risks. 

Our calculation (which assumes an average U.S. household) obscures dramatic differ- 
ences among individual households. For example, a household of one male nonsmoker and 
one female nonsmoker that is willing to spend $0.21 per person to reduce the probability of 
lung cancer by 107° (so that Da = $0.21) should spend $370 per pCi/L of radon reduction 
because their risk of lung cancer is less than for the average U.S. household. As a result, a 
suitable action level for such a household is Raction = 7.4 pCi/L, which can be compared to 
Raction = 4 for the average household. In contrast, if the male and female are both smokers, 
they should be willing to spend the much higher value of $1900 per pCi/L, because of their 
higher risk of lung cancer, and thus should have an action level of Raction = 3.1 pCi/L. 

Other sources of variation in Raction across households, in addition to household com- 
position, are (a) variation in risk preferences, (b) variation in individual beliefs about the 
risks of radon and the effects of remediation, and (c) variation in the perceived dollar value 
associated with a given risk reduction. Through the rest of our analysis we use Raction = 4 
pCi/L as an exemplary value, but rational informed individuals might plausibly choose dif- 
ferent values of Raction, depending on financial resources, general risk tolerance, attitude 
towards radon risk, as well as the number of people in the household and their smoking 
habits. 


Bayesian inference for county radon levels 


The previous discussion concerns decision making under certainty. Individual homeowners 

are likely to have limited information about the radon exposure level for their home. A goal 

of some researchers has been to identify locations and predictive variables associated with 

high-radon homes so that monitoring and remediation programs can be focused efficiently. 
Two datasets are readily available for such a study: 


e Long-term measurements from approximately 5000 houses, selected as a cluster sample 
from 125 randomly selected counties. 

e Short-term measurements from about 80,000 houses, sampled at random from all the 
counties in the U.S. 

This is a pattern we sometimes see: a relatively small amount of accurate data, along with 

a large amount of biased and imprecise data. The challenge is to use the good data to 
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calibrate the bad data, so that inference can be made about the entire country, not merely 
the 125 counties in the sample of long-term measurements. 


Hierarchical model. We simultaneously calibrate the data and predict radon levels by fit- 
ting a hierarchical model to both sets of measurements, using predictors at the house and 
county level, with a separate model fit to each of the 10 regions of the U.S. Let y; denote the 
logarithm of the radon measurement of house i within county j(i) and X denote a matrix 
of household-level predictors including indicators for whether the house has a basement and 
whether the basement is a living area, along with an indicator variable that equals 1 if 
measurement 7 is a short-term screening measurement. Including the indicator corrects for 
the biases in the screening measurements. We assume a normal linear regression model, 
yi ~ N(Xib + ayy, o?), for houses i = 1,...,n, 

where aj(;) is a county effect, and the data-level variance parameter o; can take on two 
possible values depending on whether measurement i is long- or short-term. 

The county parameters a; are also assumed normally distributed, 


aj ~ N(W37 + ôk(j)» 7), for counties j = 1,..., J, 


with county-level predictors W including climate data and a measure of the uranium level 
in the soil, and the indicator 6; (;) characterizing the geology of county j as one of K = 19 
types. Finally, the coefficients 6; for the 19 geological types are themselves estimated from 
the data, 

ôk ~ N(0,K?), for geologic types k = 1,..., K, 


as are the hierarchical variance components 7 and «. Finally, we divide the country into 
ten regions and fit the model separately within each region. 

Combining long- and short-term measurements allows us to estimate the distribution 
of radon levels in nearly every county in the U.S., albeit with widely varying uncertainties 
depending primarily on the number of houses in the sample within the county. 


Inferences. Unfortunately (from the standpoint of radon mitigation), indoor radon con- 
centrations are highly variable even within small areas. Given the predictors included in 
the model, the radon level of an individual house in a specified county can be predicted 
only to within a factor of at best about 1.9 (that is to say, the posterior geometric standard 
deviation is about 1.9), with a factor of 2.3 being more typical, a disappointingly large 
predictive uncertainty considering the factor of 3.1 that would hold given no information on 
the home other than that it is in the U.S. On the other hand, this seemingly modest reduc- 
tion in uncertainty is still enough to identify some areas where high-radon homes are very 
rare or very common. For instance, in the mid-Atlantic states, more than half the houses 
in some counties have long-term living area concentrations over the EPA’s recommended 
action level of 4 pCi/L, whereas in other counties fewer than one-half of one percent exceed 
that level. 


Bayesian inference for the radon level in an individual house 


We use the fitted hierarchical regression model to perform inferences and decision analyses 
for previously unmeasured houses 2, using the following notation: 


Ri = radon concentration in house i 


For the decision on house i, we need the posterior predictive distribution for a given 
0i, averaging over the posterior uncertainties in regression coefficients, county effects, and 
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variance components; it will be approximately normal (because the variance components 
are so well estimated), and we label it as 


6; ~ N(M;, S2), (9.5) 


where M; and S; are computed from the posterior simulations of the model estimation. The 
mean is M; = Xi + Ĝji), where X; is a row vector containing the house-level predictors 
(indicators for whether the house has a basement and whether the basement is a living 
area) for house i, and (Â, â) are the posterior means from the analysis in the appropriate 
region of the country. The variance $? (obtained from the earlier posterior computation) 
includes the posterior uncertainty in the coefficients a, @ and also the hierarchical variance 
components T? and «?. (We are predicting actual radon levels, not measurements, and so 
a? does not play a role here.) It turns out that the geometric standard deviations e° of the 
predictive distributions for home radon levels vary from 2.1 to 3.0, and they are in the range 
(2.1,2.5) for most U.S. houses. (The houses with e° > 2.5 lie in small-population counties 
for which little information was available in the radon surveys, resulting in relatively high 
predictive uncertainty within these counties.) The geometric means of the house predictive 
distributions, e™ , vary from 0.1 to 14.6 pCi/L, with 95% in the range [0.3, 3.7] and 50% in 
the range [0.6, 1.6]. The houses with the highest predictive geometric means are houses with 
basement living areas in high-radon counties; the houses with lowest predictive geometric 
means have no basements and lie in low-radon counties. 

The distribution (9.5) summarizes the state of knowledge about the radon level in a 
house given only its county and basement information. In this respect it serves as a prior 
distribution for the homeowner. Now suppose a measurement y ~ N(0, o°) is taken in the 
house. (We are assuming an unbiased measurement. If a short-term measurement is being 
used, it will have to be corrected for the biases which were estimated in the regression 
models.) In our notation, y and @ are the logarithms of the measurement and the true 
home radon level, respectively. The posterior distribution for @ is 


OM, y ~ N(A, V), (9.6) 
where i 
z+% 1 
gz t oF grt oF 


We base our decision analysis of when to measure and when to remediate on the distributions 


(9.5) and (9.6). 


B 


Decision analysis for individual homeowners 


We now work out the optimal decisions of measurement and remediation conditional on the 
predicted radon level in a home, the additional risk of lung cancer death from radon, the 
effects of remediation, and individual attitude toward risk. 

Given an action level under certainty, Raction, we address the question of whether to 
pay for a home radon measurement and whether to remediate. The decision of whether to 
measure depends on the prior distribution (9.5) of radon level for your house, given your 
predictors X. We use the term ‘prior distribution’ to refer to the predictive distribution 
based on our hierarchical model; the predictive distribution conditions on the survey data 
but is prior to any specific measurements for the house being considered. The decision 
of whether to remediate depends on the posterior distribution (9.6) if a measurement has 
been taken or the prior distribution (9.5) otherwise. In our computations, we use the 
following results from the normal distribution: if z ~ N(p,s?), then E(e7) = et+t35° and 
E(e?|z > a) Pr(z > a) = e”+t35 a-o), where ® is the standard normal cumulative 
distribution function. 
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The decision tree is set up with three branches. In each branch, we evaluate the expected 
loss in dollar terms, converting radon exposure (over a 30-year period) to dollars using 
D, = $2000/(Raction — Rremea) a8 the equivalent cost per pCi/L for additional home radon 
exposure. In the expressions below we let R = e? be the unknown radon exposure level in 
the home being considered; the prior and posterior distributions are normal distributions 
for 0 = log R. 

1. Remediate without monitoring. Expected loss is remediation cost + equivalent 
dollar cost of radon exposure after remediation: 


Li = $2000 + D,E(min(R, Rremea)) 
= $2000 + Dr [Rremed Pr(R = Rremed ) + E(R|R < Rremed ) Pr(R < Rremed)| 


a 2 2 — 
= $2000 + D| Reon ® — Lette h —@ (een) ios) 


2. Do not monitor or remediate. Expected loss is the equivalent dollar cost of radon 
exposure: 
Lo = D-E(R) = Dre” t32, (9.9) 
3. Take a measurement y (measured in log pCi/L). The immediate loss is the mea- 
surement cost (assumed to be $50) and, in addition, the radon exposure during the year 
that you are taking the measurement (which is 35 of the 30-year exposure (9.9)). The 
inner decision has two branches: 
(a) Remediate. Expected loss is the immediate loss due to measurement plus the reme- 
diation loss which is computed as for decision 1, but using the posterior rather than 
the prior distribution: 


1 fii o2 A log(Rremed) 
L a $50 D,— Mtas $2000 D; Rreme ® 
3 + 30° + F d z T 


derhaY (1 -0 pa ae — een) ))| l (9.10) 


where A and V are the posterior mean and variance from (9.7). 
(b) Do not remediate. Expected loss is: 


L3p = $50 + Diets Dreta, (9.11) 
Deciding whether to remediate given a measurement. To evaluate the decision tree, we must 
first consider the inner decision between 3(a) and 3(b), conditional on the measurement y. 
Let yo be the point (on the logarithmic scale) at which you will choose to remediate if 
y > yo, or do nothing if y < yo. (Because of measurement error, y Æ 0, and consequently 
Yo Æ log(Raction)-) We determine yg, which depends on the prior mean M, the prior 
standard deviation S, and the measurement standard deviation ø, by numerically solving 
the implicit equation 
Loa = Lap at yY = yo. (9.12) 


Details of our approach for solving the equation are not provided here. 


Deciding among the three branches. The expected loss for immediate remediation (9.8) 
and the expected loss for no action (9.9) can be determined directly for a given prior mean 
M, prior standard deviation S, and specified dollar value D, for radon reduction. We 
determine the expected loss for branch 3 of the decision tree, 


Ls = E(min(L3a, L3b)), (9.13) 


by averaging over the prior uncertainty in the measurement y (given a value for the mea- 
surement variability ø) as follows. 


This electronic edition is for non-commercial purposes only. 


9.4. HIERARCHICAL DECISION ANALYSIS FOR HOME RADON 253 


co Remediate without 
monitoring 


Koi / Take a measurement 


prior geometric mean, exp(M) 


Do not monitor or remediate 


6 8 10 12 14 
perfect-information action level, R_action (pCi/L) 


Figure 9.5 Recommended radon remediation/measurement decision as a function of the perfect- 
information action level Raction and the prior geometric mean radon level eM under the simplifying 
assumption that e° = 2.3. You can read off your recommended decision from this graph and, if 
the recommendation is ‘take a measurement,’ you can do so and then perform the calculations to 
determine whether to remediate, given your measurement. The horizontal axis of this figure begins 
at 2 pCi/L because remediation is assumed to reduce home radon level to 2 pCi/L, so it makes no 
sense for Raction to be lower than that value. Wiggles in the lines are due to simulation variability. 


1. Simulate 5000 draws of y ~ N(M, S? + o°). 
2. For each draw of y, compute min(L3q, L3b) from (9.10) and (9.11). 
3. Estimate L3 as the average of these 5000 values. 


This expected loss is valid only if we assume that you will make the recommended optimal 
decision once the measurement is taken. 


We can now compare the expected losses L1, L2, L3, and choose among the three deci- 
sions. The recommended decision is the one with the lowest expected loss. An individual 
homeowner can apply this approach simply by specifying Raction (the decision threshold un- 
der certainty), looking up the prior mean and standard deviation for the home’s radon level 
as estimated by the hierarchical model, and determining the optimal decision. In addition, 
our approach makes it possible for a homeowner to take account any additional information 
that is available. For example, if a measurement is available for a neighbor’s house, then 
one can update the prior mean and standard deviation to include this information. 

If we are willing to make the simplifying assumption that ø = log(1.2) and S = log(2.3) 
for all counties, then we can summarize the decision recommendations by giving threshold 
levels Miow and Mpign for which decision 1 (remediate immediately) is preferred if M > 
Mhigh, decision 2 (do not monitor or remediate) is preferred if M < Miow, and decision 3 
(take a measurement) is preferred if M € [Miow, Mhign]. Figure 9.5 displays these cutoffs 
as a function of Raction, and thus displays the recommended decision as a function of 
(Raction, e“ ). For example, setting Raction = 4 pCi/L leads to the following recommendation 
based on e™, the prior GM of your home radon based on your county and house type: 


e If e™ is less than 1.0 pCi/L (which corresponds to 68% of U.S. houses), do nothing. 


e If e™ is between 1.0 and 3.5 pCi/L (27% of U.S. houses), perform a long-term measure- 
ment (and then decide whether to remediate). 


e If e™ is greater than 3.5 pCi/L (5% of U.S. houses), remediate immediately without 
measuring. Actually, in this circumstance, short-term monitoring can turn out to be 
(barely) cost-effective if we include it as an option. We ignore this additional complexity 
to the decision tree, since it occurs rarely and has little impact on the overall cost-benefit 


analysis. 
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Figure 9.6 Maps showing (a) fraction of houses in each county for which measurement is recom- 
mended, given the perfect-information action level of Raction = 4 pCi/L; (b) expected fraction of 
houses in each county for which remediation will be recommended, once the measurement y has 
been taken. For the present radon model, within any county the recommendations on whether to 
measure and whether to remediate depend only on the house type: whether the house has a basement 
and whether the basement is used as living space. Apparent discontinuities across the boundaries 
of Utah and South Carolina arise from irregularities in the radon measurements from the radon 
surveys conducted by those states, an issue we ignore here. 


Aggregate consequences of individual decisions 


Now that we have made idealized recommendations for individual homeowners, we consider 
the aggregate effects if the recommendations are followed by all homeowners in the U.S. In 
particular, we compare the consequences of individuals following our recommendations to 
the consequences of other policies such that implicitly recommended by the EPA, of taking 
a short-term measurement as a condition of a home sale and performing remediation if the 
measurement exceeds 4 pCi/L. 


Applying the recommended decision strategy to the entire country. Figure 9.6 displays the 
geographic pattern of recommended measurements (and, after one year, recommended re- 
mediations), based on an action level Raction of 4 pCi/L. Each county is shaded according to 
the proportion of houses for which measurement (and then remediation) is recommended. 
These recommendations incorporate the effects of parameter uncertainties in the models 
that predict radon distributions within counties, so these maps would be expected to change 
somewhat as better predictions become available. 

From a policy standpoint, perhaps the most significant feature of the maps is that 
even if the EPA’s recommended action level of 4 pCi/L is assumed to be correct—and, as 
we have discussed, it does lead to a reasonable value of Dg, under standard dose-response 
assumptions—monitoring is still not recommended in most U.S. homes. Indeed, only 28% of 
U.S. homes would perform radon monitoring. A higher action level of 8 pCi/L, a reasonable 
value for nonsmokers under the standard assumptions, would lead to even more restricted 
monitoring and remediation: only about 5% of homes would perform monitoring. 


Evaluation of different decision strategies. We estimate the total monetary cost and lives 
saved if each of the following decision strategies were to be applied nationally: 


1. The recommended strategy from the decision analysis (that is, monitor homes with prior 
mean estimates above a given level, and remediate those with high measurements). 

2. Performing long-term measurements on all houses and then remediating those for which 
the measurement exceeds the specified radon action level Raction. 

3. Performing short-term measurements on all houses and then remediating those for which 


the bias-corrected measurement exceeds the specified radon action level Raction (with the 
bias estimated from the hierarchical regression model). 
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Figure 9.7 Expected lives saved vs. expected cost for various radon measurement/ remediation strate- 
gies. Numbers indicate values of Raction. The solid line is for the recommended strategy of measur- 
ing only certain homes; the others assume that all homes are measured. All results are estimated 
totals for the U.S. over a 80-year period. 


4. Performing short-term measurements on all houses and then remediating those for which 
the uncorrected measurement exceeds the specified radon action level Raction. 


We evaluate each of the above strategies in terms of aggregate lives saved and dollars 
cost, with these outcomes parameterized by the radon action level Raction. Both lives 
saved and costs are considered for a 30-year period. For each strategy, we assume that 
the level Raction is the same for all houses (this would correspond to a uniform national 
recommendation). To compute the lives saved, we assume that the household composition 
for each house is the same as the average in the U.S. We evaluate the expected cost and 
the expected number of lives saved by aggregating over the decisions for individual homes 
in the country. In practice for our model we need only consider three house types defined 
by our predictors (no basement, basement is not living space, basement is living space) for 
each of the 3078 counties. 

We describe the results of our expected cost and expected lives saved calculation in 
some detail only for the decision strategy based on our hierarchical model. If the strategy 
were followed everywhere with Raction = 4 pCi/L (as pictured in the maps in Figure 9.6), 
about 26% of the 70 million ground-contact houses in the U.S. would monitor and about 
4.5% would remediate. The houses being remediated include 2.8 million homes with radon 
levels above 4 pCi/L (74% of all such homes), and 840,000 of the homes above 8 pCi/L 
(91% of all such homes). The total monetary cost is estimated at $7.3 billion—$1 billion 
for measurement and $6.3 billion for remediation—and would be expected to save the lives 
of 49,000 smokers and 35,000 nonsmokers over a 30-year period. Total cost and total lives 
saved for other action levels and other decision strategies are calculated in the same way. 

Figure 9.7 displays the tradeoff between expected cost and expected lives saved over 
a thirty-year period for the four strategies. The numbers on the curves are action levels, 
Raction. This figure allows us to compare the effectiveness of alternative strategies of equal 
expected cost or equal expected lives saved. For example, the recommended strategy (the 
solid line on the graph) at Raction = 4 pCi/L would result in an expected 83,000 lives saved 
at an expected cost of $7.3 billion. Let us compare this to the EPA’s implicitly recommended 
strategy based on uncorrected short-term measurements (the dashed line on the figure). For 
the same cost of $7.3 billion, the uncorrected short-term strategy is expected to save only 
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64,000 lives; to achieve the same expected savings of 83,000 lives, the uncorrected short-term 
strategy would cost about $12 billion. 


9.5 Personal vs. institutional decision analysis 


Statistical inference has an ambiguous role in decision making. Under a ‘subjective’ view of 
probability (which we do not generally find useful; see Sections 1.5-1.7), posterior inferences 
represent the personal beliefs of the analyst, given his or her prior information and data. 
These can then be combined with a subjective utility function and input into a decision tree 
to determine the optimal decision, or sequence of decisions, so as to maximize subjective 
expected utility. This approach has serious drawbacks as a procedure for personal decision 
making, however. It can be more difficult to define a utility function and subjective prob- 
abilities than to simply choose the most appealing decision. The formal decision-making 
procedure has an element of circular reasoning, in that one can typically come to any desired 
decision by appropriately setting the subjective inputs to the analysis. 

In practice, then, personal decision analysis is most useful when the inputs (utilities and 
probabilities) are well defined. For example, in the cancer screening example discussed in 
Section 9.3, the utility function is noncontroversial—years of life, with a slight adjustment 
for quality of life—and the relevant probabilities are estimated from the medical literature. 
Bayesian decision analysis then serves as a mathematical tool for calculating the expected 
value of the information that would come from the screening. 

In institutional settings such as businesses, governments, or research organizations, de- 
cisions need to be justified, and formal decision analysis has a role to play in clarifying the 
relation between the assumptions required to build and apply a relevant probability model 
and the resulting estimates of costs and benefits. We introduce the term institutional deci- 
sion analysis to refer to the process of transparently setting up a probability model, utility 
function, and an inferential framework leading to cost estimates and decision recommen- 
dations. Depending on the institutional setting, the decision analysis can be formalized to 
different extents. For example, the meta-analysis in Section 9.2 leads to fairly open-ended 
recommendations about incentives for sample surveys—given the high levels of posterior 
uncertainties, it would not make sense to give a single recommendation, since it would be 
so sensitive to the assumptions about the relative utility of dollars and response rate. For 
the cancer-screening example in Section 9.3, the decision analysis is potentially useful both 
for its direct recommendation (not to perform bronchoscopy for this sort of patient) and 
also because it can be taken apart to reveal the sensitivity of the conclusion to the different 
assumptions taken from the medical literature on probabilities and expected years of life. 

In contrast, the key assumptions in the hierarchical decision analysis for radon ex- 
posure in Section 9.4 have to do with cost-benefit tradeoffs. By making a particular 
assumption about the relative importance of dollars and cancer risk (corresponding to 
Raction = 4 pCi/L), we can make specific recommendations by county (see the maps in 
Figure 9.6 on page 254). It would be silly to believe that all households in the United 
States have utility functions equivalent to this constant level of Raction, but the analysis 
resulting in the maps is useful to give a sense of a uniform recommendation that could be 
made by the government. 

In general, there are many ways in which statistical inferences can be used to inform 
decision-making. The essence of the ‘objective’ or ‘institutional’ Bayesian approach is to 
clearly identify the model assumptions and data used to form the inferences, evaluate the 
reasonableness and the fit of the model’s predictions (which include decision recommenda- 
tions as a special case), and then expand the model as appropriate to be more realistic. 
The most useful model expansions are typically those that allow more information to be 
incorporated into the inferences. 
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9.6 Bibliographic note 


Berger (1985) and DeGroot (1970) both give clear presentations of the theoretical issues 
in decision theory and the connection to Bayesian inference. Many introductory books 
have been written on the topic; Luce and Raiffa (1957) is particularly interesting for its 
wide-ranging discussions. Savage (1954) is an influential early work that justifies Bayesian 
statistical methods in terms of decision theory. 

Gneiting (2011) reviews scoring functions for point predictions and also presents survey 
results of use of scoring functions in the evaluation of point forecasts in businesses and 
organizations. Gneiting, and Raftery (2007) review scoring rules for probabilistic prediction. 
Vehtari and Ojanen (2012) provide a detailed decision theoretic review of Bayesian predictive 
model assessment, selection, and comparison methods. 

Clemen (1996) provides a thorough introduction to applied decision analysis. Parmigiani 
(2002) is a textbook on medical decision making from a Bayesian perspective. The articles in 
Kahneman, Slovic, and Tversky (1982) and Gilovich, Griffin, and Kahneman (2002) address 
many of the component problems in decision analysis from a psychological perspective. 

The decision analysis for incentives in telephone surveys appears in Gelman, Stevens, 
and Chan (2003). The meta-analysis data were collected by Singer et al. (1999), and Groves 
(1989) discusses many practical issues in sampling, including the effects of incentives in mail 
surveys. More generally, Dehejia (2005) discusses the connection between decision analysis 
and causal inference in models with interactions. 

Parmigiani (2004) discusses the value of information in medical diagnostics. Parmigiani 
et al. (1999) and the accompanying discussions include several perspectives on Bayesian 
inference in medical decision making for breast cancer screening. The cancer screening 
example in Section 9.3 is adapted and simplified from Moroff and Pauker (1983). The 
journal Medical Decision Making, where this article appears, contains many interesting 
examples and discussions of applied decision analysis. Heitjan, Moskowitz, and Whang 
(1999) discuss Bayesian inference for cost-effectiveness of medical treatments. Fouskakis 
and Draper (2008), and Fouskakis, Ntzoufras, and Draper (2009) discuss a model selection 
example in which monetary utility is placed for the data collection costs as well as for the 
accuracy of predicting the mortality rate in a health policy problem. Lau, Ioannidis, and 
Schmid (1997) present a general review of meta-analysis in medical decision making. 

The radon problem is described by Lin et al. (1999). Boscardin and Gelman (1996) 
describe the computations for the hierarchical model for the radon example in more detail. 
Ford et al. (1999) present a cost-benefit analysis of the radon problem without using a 
hierarchical model. 


9.7 Exercises 


1. Basic decision analysis: Widgets cost $2 each to manufacture and you can sell them for 
$3. Your forecast for the market for widgets is (approximately) normally distributed with 
mean 10,000 and standard deviation 5,000. How many widgets should you manufacture 
in order to maximize your expected net profit? 


2. Conditional probability and elementary decision theory: Oscar has lost his dog; there is 
a 70% probability it is in forest A and a 30% chance it is in forest B. If the dog is in 
forest A and Oscar looks there for a day, he has a 50% chance of finding the dog. If the 
dog is in forest B and Oscar looks there for a day, he has an 80% chance of finding the 
dog. 

(a) If Oscar can search only one forest for a day, where should he look to maximize his 
probability of finding the dog? What is the probability that the dog is still lost after 
the search? 

(b) Assume Oscar made the rational decision and the dog is still lost (and is still in the 
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same forest as yesterday). Where should he search for the dog on the second day? 
What is the probability that the dog is still lost at the end of the second day? 

Again assume Oscar makes the rational decision on the second day and the dog is still 
lost (and is still in the same forest). Where should he search on the third day? What 
is the probability that the dog is still lost at the end of the third day? 

(Expected value of additional information.) You will now figure out the expected value 
of knowing, at the beginning, which forest the dog is in. Suppose Oscar will search 
for at most three days, with the following payoffs: —1 if the dog is found in one day, 
—2 if the dog is found on the second day, and —3 if the dog is found on the third day, 
and —10 otherwise. 


i. What is Oscar’s expected payoff without the additional information? 
ii. What is Oscar’s expected payoff if he knows the dog is in forest A? 
iii. What is Oscar’s expected payoff if he knows the dog is in forest B? 
iv. Before the search begins, how much should Oscar be willing to pay to be told which 
forest his dog is in? 


3. Decision analysis: 


(a) 


Formulate an example from earlier in this book as a decision problem. (For example, in 
the bioassay example of Section 3.7, there can be a cost of setting up a new experiment, 
a cost per rat in the experiment, and a benefit to estimating the dose-response curve 
more accurately. Similarly, in the meta-analysis example in Section 5.6, there can be a 
cost per study, a cost per patient in the study, and a benefit to accurately estimating 
the efficacy of beta-blockers.) 

Set up a utility function and determine the expected utility for each decision option 
within the framework you have set up. 

Explore the sensitivity of the results of your decision analysis to the assumptions you 
have made in setting up the decision problem. 
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Part I: Advanced Computation 


The remainder of this book delves into more sophisticated models. Before we begin this 
enterprise, however, we detour to describe methods for computing posterior distributions 
in hierarchical models. Toward the end of Chapter 5, the algebra required for analytic 
derivation of posterior distributions became less and less attractive, and that was with a 
relatively simple model constructed entirely from normal distributions. If we try to solve 
more complicated problems analytically, the algebra starts to overwhelm the statistical 
science almost entirely, making the full Bayesian analysis of realistic probability models too 
cumbersome for most practical applications. Fortunately, a battery of powerful methods has 
been developed over the past few decades for approximating and simulating from probability 
distributions. In the next four chapters, we survey some useful simulation methods that we 
apply in later chapters in the context of specific models. Some of the simpler simulation 
methods we present here have already been introduced in examples in earlier chapters. 

Because the focus of this book is on data analysis rather than computation, we move 
through the material of Part III briskly, with the intent that it be used as a reference when 
applying the models discussed in Parts IV and V. We have also attempted to place a variety 
of useful techniques in the context of a systematic general approach to Bayesian compu- 
tation. Our general philosophy in computation, as in modeling, is pluralistic, developing 
approximations using a variety of techniques. 
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Chapter 10 


Introduction to Bayesian computation 


Bayesian computation revolves around two steps: computation of the posterior distribu- 
tion, p(@|y), and computation of the posterior predictive distribution, p(gly). So far we 
have considered examples where these could be computed analytically in closed form, with 
simulations performed directly using a combination of preprogrammed routines for standard 
distributions (normal, gamma, beta, Poisson, and so forth) and numerical computation on 
grids. For complicated or unusual models or in high dimensions, however, more elaborate 
algorithms are required to approximate the posterior distribution. Often the most efficient 
computation can be achieved by combining different algorithms. We discuss these algo- 
rithms in Chapters 11-13. This chapter provides a brief summary of statistical procedures 
to approximately evaluate integrals. The bibliographic note at the end of this chapter 
suggests other sources. 


Normalized and unnormalized densities 


We refer to the (multivariate) distribution to be simulated as the target distribution and call 
it p(@|y). Unless otherwise noted (in Section 13.10), we assume that p(6|y) can be easily 
computed for any value 0, up to a factor involving only the data y; that is, we assume there 
is some easily computable function q(6|y), an unnormalized density, for which q(0|y)/p(0|y) 
is a constant that depends only on y. For example, in the usual use of Bayes’ theorem, we 
work with the product p(@)p(y|@), which is proportional to the posterior density. 


Log densities 


To avoid computational overflows and underflows, one should compute with the logarithms 
of posterior densities whenever possible. Exponentiation should be performed only when 
necessary and as late as possible; for example, in the Metropolis algorithm, the required 
ratio of two densities (11.1) should be computed as the exponential of the difference of the 
log-densities. 


10.1 Numerical integration 


Numerical integration, also called ‘quadrature,’ refers to methods in which the integral over 
continuous function is evaluated by computing the value of the function at finite number of 
points. By increasing the number of points where the function is evaluated, desired accuracy 
can be obtained. Numerical integration methods can be divided to simulation (stochastic) 
methods, such as Monte Carlo, and deterministic methods such as many quadrature rule 
methods. 

The posterior expectation of any function h(@) is defined as E(h(0)|y) = [h(A)p(O|y)dd, 
where the integral has as many dimensions as 0. Conversely, we can express any integral over 
the space of 0 as a posterior expectation by defining h(@) appropriately. If we have posterior 
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draws 6° from p(6|y), we can estimate the integral by the sample average, ye h(6*). 


For any finite number of simulation draws, the accuracy of this estimate can be roughly 
gauged by the standard deviation of the A(0°) values (we discuss this in more detail in 
Section 10.5). If it is not easy to draw from the posterior distribution, or if the h(0*) values 
are too variable (so that the sample average is too variable an estimate to be useful), more 
sampling methods are necessary. 


Simulation methods 


Simulation (stochastic) methods are based on obtaining random samples 0° from the desired 
distribution p(0) and estimating the expectation of any function h(6|y), 


S 
E(h()ly) = f AOO) ~ 5 YMO). (10.1) 


The estimate is stochastic depending on generated random numbers, but the accuracy of 
the simulation can be improved by obtaining more samples. Basic Monte Carlo methods 
which produce independent samples are discussed in Sections 10.3-10.4 and Markov chain 
Monte Carlo methods which can better adapt to high-dimensional complex distributions, 
but produce dependent samples, are discussed in Chapters 11-12. Markov chain Monte 
Carlo methods have been important in making Bayesian inference practical for generic 
hierarchical models. Simulation methods can be used for high-dimensional distributions, 
and there are general algorithms which work for a wide variety of models; where necessary, 
more efficient computation can be obtained by combining these general ideas with tailored 
simulation methods, deterministic methods, and distributional approximations. 


Deterministic methods 


Deterministic numerical integration methods are based on evaluating the integrand h(0)p(0|y) 
at selected points 0°, based on a weighted version of (10.1): 


1 S 
E(h()|y) = fOO ~ =D) wsh(0*)p(6*|y), 


with weight w, corresponding to the volume of space represented by the point 6°. More 
elaborate rules, such as Simpson’s, use local polynomials for improved accuracy. Determin- 
istic numerical integration rules typically have lower variance than simulation methods, but 
selection of locations gets difficult in high dimensions. 

The simplest deterministic method is to evaluate the integrand in a grid with equal 
weights. Grid methods can be made adaptive starting the grid formation from the posterior 
mode. For an integrand where one part has some specific form as Gaussian, there are specific 
quadrature rules that can give more accurate estimates with fewer integrand evaluations. 
Quadrature rules exist for both bounded and unbounded regions. 


10.2 Distributional approximations 


Distributional (analytic) approximations approximate the posterior with some simpler para- 
metric distribution, from which integrals can be computed directly or by using the approx- 
imation as a starting point for simulation-based methods. We have already discussed the 
normal approximation in Chapter 4, and we consider more advanced approximation meth- 
ods in Chapter 13. 
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Crude estimation by ignoring some information 


Before developing elaborate approximations or complicated methods for sampling from the 
target distribution, it is almost always useful to obtain a rough estimate of the location of 
the target distribution—that is, a point estimate of the parameters in the model—using 
some simple, noniterative technique. The method for creating this first estimate will vary 
from problem to problem but typically will involve discarding parts of the model and data 
to create a simple problem for which convenient parameter estimates can be found. 

In a hierarchical model, one can sometimes roughly estimate the main parameters y 
by first estimating the hyperparameters @ crudely, then using the conditional posterior 
distribution of y|, y. We applied this approach to the rat tumor example in Section 5.1, 
where crude estimates of the hyperparameters (a, 3) were used to obtain initial estimates 
of the other parameters, 0;. 

For another example, in the educational testing analysis in Section 5.5, the school ef- 
fects 0; can be crudely estimated by the data yj from the individual experiments, and 
the between-school standard deviation 7 can then be estimated crudely by the standard 
deviation of the eight y;-values or, to be slightly more sophisticated, the estimate (5.22), 
restricted to be nonnegative. 

When some data are missing, a good way to get started is by simplistically imputing the 
missing values based on available data. (Ultimately, inferences for the missing data should 
be included as part of the model; see Chapter 18.) 

In addition to creating a starting point for a more exact analysis, crude inferences are 
useful for comparison with later results—if the rough estimate differs greatly from the results 
of the full analysis, the latter may well have errors in programming or modeling. Crude 
estimates are often convenient and reliable because they can be computed using available 
computer programs. 


10.3 Direct simulation and rejection sampling 


In simple nonhierarchical Bayesian models, it is often easy to draw from the posterior 
distribution directly, especially if conjugate prior distributions have been assumed. For more 
complicated problems, it can help to factor the distribution analytically and simulate it in 
parts, first sampling from the marginal posterior distribution of the hyperparameters, then 
drawing the other parameters conditional on the data and the simulated hyperparameters. 
It is sometimes possible to perform direct simulations and analytic integrations for parts of 
the larger problem, as was done in the examples of Chapter 5. 

Frequently, draws from standard distributions or low-dimensional non-standard distri- 
butions are required, either as direct draws from the posterior distribution of the estimand 
in an easy problem, or as an intermediate step in a more complex problem. Appendix A 
is a relatively detailed source of advice, algorithms, and procedures specifically relating to 
a variety of commonly used distributions. In this section, we describe methods of drawing 
a random sample of size 1, with the understanding that the methods can be repeated to 
draw larger samples. When obtaining more than one sample, it is often possible to reduce 
computation time by saving intermediate results such as the Cholesky factor for a fixed 
multivariate normal distribution. 


Direct approximation by calculating at a grid of points 


For the simplest discrete approximation, compute the target density, p(@ly), at a set of 
evenly spaced values 61,...,9n, that cover a broad range of the parameter space for 0, then 
approximate the continuous p(@|y) by the discrete density at 61,...,9n, with probabilities 
p(Oily)/ D p(@;|y). Because the approximate density must be normalized anyway, this 
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Figure 10.1 Illustration of rejection sampling. The top curve is an approximation function, Mg(0), 
and the bottom curve is the target density, p(@|y). As required, Mg(@) > p(@ly) for all 0. The 
vertical line indicates a single random draw 0 from the density proportional to g. The probability 
that a sampled draw 0 is accepted is the ratio of the height of the lower curve to the height of the 
higher curve at the value 0. 


method will work just as well using an unnormalized density function, q(@|y), in place of 
p(Aly). 

Once the grid of density values is computed, a random draw from p(6|y) is obtained by 
drawing arandom sample U from the uniform distribution on [0, 1], then transforming by the 
inverse cdf method (see Section 1.9) to obtain a sample from the discrete approximation. 
When the points 0; are spaced closely enough and miss nothing important beyond their 
boundaries, this method works well. The discrete approximation is more difficult to use 
in higher-dimensional multivariate problems, where computing at every point in a dense 
multidimensional grid becomes prohibitively expensive. 


Simulating from predictive distributions 


Once we have a sample from the posterior distribution, p(6|y), it is typically easy to draw 
from the predictive distribution of unobserved or future data, y. For each draw of 0 from 
the posterior distribution, just draw one J from the predictive distribution, p(y|@). The set 
of simulated y’s from all the 6’s characterizes the posterior predictive distribution. Posterior 
predictive distributions are crucial to the model-checking approach described in Chapter 6. 


Rejection sampling 


Suppose we want to obtain a single random draw from a density p(@|y), or perhaps an 

unnormalized density g(@|y) (with p(6|y) = q(6ly)/ f¢q(@|y)d0). In the following description 

we use p to represent the target distribution, but we could just as well work with the 

unnormalized form q instead. To perform rejection sampling we require a positive function 

g(@) defined for all @ for which p(6|y) > 0 that has the following properties: 

e We can draw from the probability density proportional to g. It is not required that g(@) 
integrate to 1, but g(@) must have a finite integral. 


e The importance ratio ri must have a known bound; that is, there must be some known 


constant M for which a < M for all 8. 
The rejection sampling algorithm proceeds in two steps: 
1. Sample 0 at random from the probability density proportional to g(6). 


2. With probability i. accept 0 as a draw from p. If the drawn @ is rejected, return to 
step 1. 


Figure 10.1 illustrates rejection sampling. An accepted @ has the correct distribution, p(@|y); 
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that is, the distribution of drawn 0, conditional on it being accepted, is p(@|y) (see Exercise 
10.4). The boundedness condition is necessary so that the probability in step 2 is not greater 
than 1. 

A good approximate density g(@) for rejection sampling should be roughly proportional 
to p(6|y) (considered as a function of 0). The ideal situation is g x p, in which case, with 
a suitable value of M, we can accept every draw with probability 1. When g is not nearly 
proportional to p, the bound M must be set so large that almost all draws obtained in step 
1 will be rejected in step 2. A virtue of rejection sampling is that it is self-monitoring—if 
the method is not working efficiently, few simulated draws will be accepted. 

The function g(@) is chosen to approximate p(@|y) and so in general will depend on y. We 
do not use the notation g(0, y) or g(@|y), however, because in practice we will be considering 
approximations to one posterior distribution at a time, and the functional dependence of g 
on y is not of interest. 

Rejection sampling is used in some fast methods for sampling from standard univariate 
distributions. It is also often used for generic truncated multivariate distributions, if the 
proportion of the density mass in the truncated part is not close to 1. 


10.4 Importance sampling 


Importance sampling is a method, related to rejection sampling and a precursor to the 
Metropolis algorithm (discussed in the next chapter), that is used for computing expec- 
tations using a random sample drawn from an approximation to the target distribution. 
Suppose we are interested in E(h(0)|y), but we cannot generate random draws of 0 from 
p(O|y) and thus cannot evaluate the integral by a simple average of simulated values. 

If g(@) is a probability density from which we can generate random draws, then we can 


write, 
_ fhr@)aq@lyde _ f[h()qly)/9(9)] g(0)d0 
EMOW) = T0 ~~ TUONO (oe 
which can be estimated using S draws 6!,...,0° from g(0) by the expression, 
1575, h(6°)w(6*) 
S Asl ; 10.3 
3 Li w(05) l ) 
where the factors (6*ly) 
s q y 
OO) = E) 


are called importance ratios or importance weights. Recall that q is our general notation for 
unnormalized densities; that is, q(0|y) equals p(6|y) times some factor that does not depend 
on ô. 

It is generally advisable to use the same set of random draws for both the numerator 
and denominator of (10.3) in order to reduce the sampling error in the estimate. 

If g(0) can be chosen such that H is roughly constant, then fairly precise estimates of 
the integral can be obtained. Importance sampling is not a useful method if the importance 
ratios vary substantially. The worst possible scenario occurs when the importance ratios 
are small with high probability but with a low probability are huge, which happens, for 
example, if q has wide tails compared to g, as a function of 0. 


Accuracy and efficiency of importance sampling estimates 


In general, without some form of mathematical analysis of the exact and approximate 
densities, there is always the realistic possibility that we have missed some extremely large 
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but rare importance weights. However, it may help to examine the distribution of sampled 
importance weights to discover possible problems. It can help to examine a histogram of 
the logarithms of the largest importance ratios: estimates will often be poor if the largest 
ratios are too large relative to the average. In contrast, we do not have to worry about the 
behavior of small importance ratios, because they have little influence on equation (10.2). 
If the variance of the weights is finite, the effective sample size can be estimated using an 
approximation, 


1 
Set = (10.4) 


Dear ((89))?” 


where w(0*) are normalized weights; that is, w(0°) = w(6°)/ ES w(6*). The effective 
sample size Seg is small if there are few extremely high weights which would unduly influence 
the distribution. If the distribution has occasional very large weights, however, this estimate 
is itself noisy; it can thus be taken as no more than a rough guide. 


Importance resampling 


To obtain independent samples with equal weights, it is possible to use importance resam- 
pling (also called sampling-importance resampling or SIR). 

Once S draws, 6',...,0°, from the approximate distribution g have been sampled, a 
sample of k < S draws can be simulated as follows. 


1. Sample a value 0 from the set {0!,..., 05}, where the probability of sampling each 6° is 

gO" |y) 

g(05) ` 

2. Sample a second value using the same procedure, but excluding the already sampled 
value from the set. 


proportional to the weight, w(0*) = 


3. Repeatedly sample without replacement k — 2 more times. 


1 Why sample without replacement? If the importance weights are moderate, sampling 
with and without replacement gives similar results. Now consider a bad case, with a few 
large weights and many small weights. Sampling with replacement will pick the same few 
values of 0 repeatedly; in contrast, sampling without replacement yields a more desirable 
intermediate approximation somewhere between the starting and target densities. For other 
purposes, sampling with replacement could be superior. 


Uses of importance sampling in Bayesian computation 


Importance sampling can be used to improve analytic posterior approximations as described 
in Chapter 13. If importance sampling does not yield an accurate approximation, then 
importance resampling can still be helpful for obtaining starting points for an iterative 
simulation of the posterior distribution, as described in Chapter 11. 

Importance (re)sampling can also be useful when considering mild changes in the pos- 
terior distribution, for example replacing the normal distribution by a t in the 8 schools 
model or when computing leave-one-out cross-validation. The idea in this case is to treat the 
original posterior distribution as an approximation to the modified posterior distribution. 

A good way to develop an understanding of importance sampling is to program simula- 
tions for simple examples, such as using a t3 distribution as an approximation to the normal 
(good practice) or vice versa (bad practice); see Exercises 10.6 and 10.7. The approximating 
distribution g in importance sampling should cover all the important regions of the target 
distribution. 


1P.S. Instead of this, we now recommend Pareto smoothed importance resampling with replacement. 
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10.5 How many simulation draws are needed? 


Bayesian inferences are usually most conveniently summarized by random draws from the 
posterior distribution of the model parameters. Percentiles of the posterior distribution of 
univariate estimands can be reported to convey the shape of the distribution. For example, 
reporting the 2.5%, 25%, 50%, 75%, and 97.5% points of the sampled distribution of an 
estimand provides a 50% and a 95% posterior interval and also conveys skewness in its 
marginal posterior density. Scatterplots of simulations, contour plots of density functions, 
or more sophisticated graphical techniques can also be used to examine the posterior dis- 
tribution in two or three dimensions. Quantities of interest can be defined in terms of the 
parameters (for example, LD50 in the bioassay example in Section 3.7) or of parameters 
and data. 

We also use posterior simulations to make inferences about predictive quantities. Given 
each draw 8°, we can sample any predictive quantity, 7° ~ p(y|@*) or, for a regression model, 
JS ~ p(X, 6°). Posterior inferences and probability calculations can then be performed 
for each predictive quantity using the S simulations (for example, the predicted probability 
of Bill Clinton winning each state in 1992, as displayed in Figure 6.1 on page 143). 

Finally, given each simulation 0°, we can simulate a replicated dataset y™®P*. As de- 
scribed in Chapter 6, we can then check the model by comparing the data to these posterior 
predictive replications. 

Our goal in Bayesian computation is to obtain a set of independent draws 6°, s = 
1,..., S, from the posterior distribution, with enough draws S so that quantities of interest 
can be estimated with reasonable accuracy. For most examples, S = 100 independent draws 
are enough for reasonable posterior summaries. We can see this by considering a scalar 
parameter 0 with an approximately normal posterior distribution (see Chapter 4) with 
mean jig and standard deviation og. We assume these cannot be calculated analytically 
and instead are estimated from the mean ĝ and standard deviation sg of the S simulation 
draws. The posterior mean is then estimated to an accuracy of approximately se / VS. The 
total standard deviation of the computational parameter estimate (including Monte Carlo 
error, the uncertainty contributed by having only a finite number of simulation draws) 
is then sg\/1+1/S. For S = 100, the factor \/1+1/S is 1.005, implying that Monte 
Carlo error adds almost nothing to the uncertainty coming from actual posterior variance. 
However, it can be convenient to have more than 100 simulations just so that the numerical 
summaries are more stable, even if this stability typically confers no important practical 
advantage. 

For some posterior inferences, more simulation draws are needed to obtain desired pre- 
cisions. For example, posterior probabilities are estimated to a standard deviation of 
\/p(1 — p)/S, so that S = 100 simulations allow estimation of a probability near 0.5 to 
an accuracy of 5%. S = 2500 simulations are needed to estimate to an accuracy of 1%. 
Even more simulation draws are needed to compute the posterior probability of rare events, 
unless analytic methods are used to assist the computations. 


Example. Educational testing experiments 

We illustrate with the hierarchical model fitted to the data from the 8 schools as 
described in Section 5.5. First consider inference for a particular parameter, for ex- 
ample 01, the estimated effect of coaching in school A. Table 5.3 shows that from 200 
simulation draws, our posterior median estimate was 10, with a 50% interval of [7, 16] 
and a 95% interval of [—2,31]. Repeating the computation, another 200 draws gave 
a posterior median of 9, with a 50% interval of [6,14] and a 95% interval of [—4, 32]. 
These intervals differ slightly but convey the same general information about 01. From 
S = 10,000 simulation draws, the median is 10, the 50% interval is [6, 15], and the 95% 
interval is [—2,31]. In practice, these are no different from either of the summaries 
obtained from 200 draws. 
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We now consider some posterior probability statements. Our original 200 simulations 
gave us an estimate of 0.73 for the posterior probability Pr(@1 > 63|y), the probability 
that the effect is larger in school A than in school C (see the end of Section 5.5). This 
probability is estimated to an accuracy of ,/0.73(1 — 0.73) /200 = 0.03, which is good 
enough in this example. 

How about a rarer event, such as the probability that the effect in School A is greater 
than 50 points? None of our 200 simulations 0f exceeds 50, so the simple estimate 
of the probability is that it is zero (or less than 1/200). When we simulate S = 
10,000 draws, we find 3 of the draws to have 6; > 50, which yields a crude estimated 
probability of 0.0003. 

An alternative way to compute this probability is semi-analytically. Given u and 7, 
the effect in school A has a normal posterior distribution, p(61|u,7,y) = N(61,V1), 
where this mean and variance depend on yi, y, and 7 (see (5.17) on page 116). 
The conditional probability that 0; exceeds 50 is then Pr(0; > 50|u,7,y) = ®((61 — 
50)/./Vi), and we can estimate the unconditional posterior probability Pr(@, > 50|y) 
as the average of these normal probabilities as computed for each simulation draw 
(u°, T°). Using this approach, S = 200 draws are sufficient for a reasonably accurate 
estimate. 


In general, fewer simulations are needed to estimate posterior medians of parameters, 
probabilities near 0.5, and low-dimensional summaries than extreme quantiles, posterior 
means, probabilities of rare events, and higher-dimensional summaries. In most of the 
examples in this book, we use a moderate number of simulation draws (typically 100 to 
2000) as a way of emphasizing that applied inferences do not typically require a high level 
of simulation accuracy. 


10.6 Computing environments 


Programs exist for full Bayesian inference for commonly used models such as hierarchical 
linear and logistic regression and some nonparametric models. These implementations use 
various combinations of the Bayesian computation algorithms discussed in the following 
chapters. 


We see (at least) four reasons for wanting an automatic and general program for fitting 
Bayesian models. First, many applied statisticians and subject-matter researchers would 
like to fit Bayesian models but do not have the mathematical, statistical, and computer 
skills to program the inferential steps themselves. Economists and political scientists can 
run regressions with a single line of code or a click on a menu, epidemiologists can do logis- 
tic regression, sociologists can fit structural equation models, psychologists can fit analysis 
of variance, and education researchers can fit hierarchical linear models. We would like all 
these people to be able to fit Bayesian models (which include all those previously mentioned 
as special cases but also allow for generalizations such as robust error models, mixture dis- 
tributions, and various arbitrary functional forms, not to mention a framework for including 
prior information). 

A second use for a general Bayesian package is for teaching. Students can first learn 
to do inference automatically, focusing on the structure of their models rather than on 
computation, learning the algebra and computing later. A deeper understanding is useful— 
if we did not believe so, we would not have written this book. Ultimately it is helpful to 
learn what lies behind the inference and computation because one way we understand a 
model is by comparing to similar models that are slightly simpler or more complicated, and 
one way we understand the process of model fitting is by seeing it as a map from data and 
assumptions to inferences. Even when using a black box or ‘inference engine,’ we often want 
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to go back and see where the substantively important features of our posterior distribution 
came from. 

A third motivation for writing an automatic model-fitting package is as a programming 
environment for implementing new models, and for practitioners who could program their 
own models to be able to focus on more important statistical issues. 

Finally, a fourth potential benefit of a general Bayesian program is that it can be faster 
than custom code. There is an economy of scale. Because the automatic program will 
be used so many times, it can make sense to optimize it in various ways, implement it 
in parallel, and include algorithms that require more coding effort but are faster in hard 
problems. 

That said, no program can be truly general. Any such software should be open and 
accessible, with places where the (sophisticated) user can alter the program or ‘hold its 
hand’ to ensure that it does what it is supposed to do. 


The Bugs family of programs 


During the 1990s and early 2000s, a group of statisticians and programmers developed Bugs 
(Bayesian inference using Gibbs sampling), a general-purpose program in which a user could 
supply data and specify a statistical model using a convenient language not much different 
from the mathematical notation used in probability theory, and then the program used a 
combination of Gibbs sampling, the Metropolis algorithm, and slice sampling (algorithms 
which are described in the following chapters) to obtain samples from the posterior distribu- 
tion. When run for a sufficiently long time, Bugs could provide inference for an essentially 
unlimited variety of models. Instead of models being chosen from a list or menu of prepro- 
grammed options, Bugs models could be put together in arbitrary ways using a large set of 
probability distributions, much in the way that we construct models in this book. 

The most important limitations of Bugs have been computational. The program ex- 
cels with complicated models for small datasets but can be slow with large datasets and 
multivariate structures. Bugs works by updating one scalar parameter at a time (following 
the ideas of Gibbs sampling, as discussed in the following chapter), which results in slow 
convergence when parameters are strongly dependent, as can occur in hierarchical models. 


Stan 


We have recently developed an open-source program, Stan (named after Stanislaw Ulam, a 
mathematician who was one of the inventors of the Monte Carlo method) that has similar 
functionality as Bugs but uses a more complicated simulation algorithm, Hamiltonian Monte 
Carlo (see Section 12.4). Stan is written in C++ and has been designed to be easily 
extendable, both in allowing improvements to the updating algorithm and in being open 
to the development of new models. We now develop and fit our Bayesian models in Stan, 
using its Bugs-like modeling language and improving it as necessary to fit more complicated 
models and larger problems. Stan is intended to serve both as an automatic program for 
general users and as a programming environment for the development of new simulation 
methods. We discuss Stan further in Section 12.6 and Appendix C. 


Other Bayesian software 


Following the success of Bugs, many research groups have developed general tools for fit- 
ting Bayesian models. These include mcsim (a C program that implements Gibbs and 
Metropolis for differential equation systems such as the toxicology model described in Sec- 
tion 19.2), PyMC (a suite of routines in the open-source language Python), HBC (developed 
for discrete-parameter models in computational linguistics). These and other programs have 
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been developed by individuals or communities of users who have wanted to fit particular 
models and have found it most effective to do this by writing general implementations. In 
addition various commercial programs are under development that fit Bayesian models with 
various degrees of generality. 


10.7 Debugging Bayesian computing 
Debugging using fake data 


Our usual approach for building confidence in our posterior inferences is to fit different 
versions of the desired model, noticing when the inferences change unexpectedly. Section 
10.2 discusses crude inferences from simplified models that typically ignore some structure 
in the data. 

Within the computation of any particular model, we check convergence by running 
parallel simulations from different starting points, checking that they mix and converge to 
the same estimated posterior distribution (see Section 11.4). This can be seen as a form of 
debugging of the individual simulated sequences. 

When a model is particularly complicated, or its inferences are unexpected enough to 
be not necessarily believable, one can perform more elaborate debugging using fake data. 
The basic approach is: 


1. Pick a reasonable value for the ‘true’ parameter vector 0. Strictly speaking, this value 
should be a random draw from the prior distribution, but if the prior distribution is 
noninformative, then any reasonable value of 0 should work. 


2. If the model is hierarchical (as it generally will be), then perform the above step by 
picking reasonable values for the hyperparameters, then drawing the other parameters 
from the prior distribution conditional on the specified hyperparameters. 


3. Simulate a large fake dataset y'*° from the data distribution p(y|@). 


4. Perform posterior inference about 0 from p(6|y@*°). 


5. Compare the posterior inferences to the ‘true’ 0 from step 1 or 2. For instance, for any 
element of 6, there should be a 50% probability that its 50% posterior interval contains 
the truth. 


Formally, this procedure requires that the model has proper prior distributions and that 
the frequency evaluations be averaged over many values of the ‘true’ 0, drawn independently 
from the prior distribution in step 1 above. In practice, however, the debugging procedure 
can be useful with just a single reasonable choice of 0 in the first step. If the model does 
not produce reasonable inferences with @ set to a reasonable value, then there is probably 
something wrong, either in the computation or in the model itself. 

Inference from a single fake dataset can be revealing for debugging purposes, if the true 
value of 0 is far outside the computed posterior distribution. If the dimensionality of 0 
is large (as can easily happen with hierarchical models), we can go further and compute 
debugging checks such as the proportion of the 50% intervals that contain the true value. 

To check that inferences are correct on average, one can create a ‘residual plot’ as 
follows. For each scalar parameter 0j, define the predicted value as the average of the 
posterior simulations of 0j, and the error as the true 6; (as specified or simulated in step 1 
or 2 above) minus the predicted value. If the model is computed correctly, the errors should 
have zero mean, and we can diagnose problems by plotting errors vs. predicted values, with 
one dot per parameter. 

For models with only a few parameters, one can get the same effect by performing many 
fake-data simulations, resampling a new ‘true’ vector @ and a new fake dataset y'**° each 
time, and then checking that the errors have zero mean and the correct interval coverage, 
on average. 
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Model checking and convergence checking as debugging 


Finally, the techniques for model checking and comparison described in Chapters 6 and 
7, and the techniques for checking for poor convergence of iterative simulations, which we 
describe in Section 11.4, can also be interpreted as methods for debugging. 

In practice, when a model grossly misfits the data, or when a histogram or scatterplot 
or other display of replicated data looks weird, it is often because of a computing error. 
These errors can be as simple as forgetting to recode discrete responses (for example, 1 = 
Yes, 0 = No, —9 = Don’t Know) or misspelling a regression predictor, or as subtle as a 
miscomputed probability ratio in a Metropolis updating step (see Section 11.2), but typically 
they show up as predictions that do not make sense or do not fit the data. Similarly, poor 
convergence of an iterative simulation algorithm can sometimes occur from programming 
errors or conceptual errors in the model. 

When posterior inferences from a fitted model seem wrong, it is sometimes unclear if 
there is a bug in the program or a fundamental problem with the model itself. At this 
point, a useful conceptual and computational strategy is to simplify—to remove parameters 
from the model, or to give them fixed values or highly informative prior distributions, or 
to separately analyze data from different sources (that is, to un-link a hierarchical model). 
These computations can be performed in steps, for example first removing a parameter 
from the model, then setting it equal to a null value (for example, zero) just to check that 
adding it into the program has no effect, then fixing it at a reasonable nonzero value, then 
assigning it a precise prior distribution, then allowing it to be estimated more fully from 
the data. Model building is a gradual process, and we often find ourselves going back and 
forth between simpler and more complicated models, both for conceptual and computational 
reasons. 


10.8 Bibliographic note 


Excellent general books on simulation from a statistical perspective are Ripley (1987), and 
Gentle (2003), which cover two topics that we do not address in this chapter: creating 
uniformly distributed (pseudo)random numbers and simulating from standard distributions 
(on the latter, see our Appendix A for more details). Hammersley and Handscomb (1964) is 
a classic reference on simulation. Thisted (1988) is a general book on statistical computation 
that discusses many optimization and simulation techniques. Robert and Casella (2004) 
cover simulation algorithms from a variety of statistical perspectives. 

For further information on numerical integration techniques, see Press et al. (1986); a 
review of the application of these techniques to Bayesian inference is provided by Smith et 
al. (1985), and O’Hagan and Forster (2004). 

Bayesian quadrature methods using Gaussian process priors have been proposed by 
O’Hagan (1991) and Rasmussen and Ghahramani (2003). Adaptive grid sampling has been 
presented, for example, by Rue, Martino, and Chopin (2009). 

Importance sampling is a relatively old idea in numerical computation; for some early 
references, see Hammersley and Handscomb (1964). Geweke (1989) is a pre-Gibbs sampler 
discussion in the context of Bayesian computation; also see Wakefield, Gelfand, and Smith 
(1991). Chapters 2—4 of Liu (2001) discuss importance sampling in the context of Markov 
chain simulation algorithms. Gelfand, Dey, and Chang (1992) proposed importance sam- 
pling for fast leave-one-out cross-validation. Kong, Liu, and Wong (1996) propose a method 
for estimating the reliability of importance sampling using approximation of the variance 
of importance weights. Skare, Bolviken, and Holden (2003) propose improved importance 
sampling using modified weights which reduce the bias of the estimate. Ionides (2008) 
recommends truncation of the highest importance weights. 

Importance resampling was introduced by Rubin (1987b), and an accessible exposition 
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is given by Smith and Gelfand (1992). Skare, Bolviken, and Holden (2003) discuss why it 
is best to draw importance resamples (k < S) without replacement and they propose an 
improved algorithm that uses modified weights. When k = S draws are required, Kitagawa 
(1996) presents stratified and deterministic resampling, and Liu (2001) presents residual 
resampling; these methods all have smaller variance than simple random resampling. 

Kass et al. (1998) discuss many practical issues in Bayesian simulation. Gelman and 
Hill (2007, chapter 8) and Cook, Gelman, and Rubin (2006) show how to check Bayesian 
computations using fake-data simulation. Kerman and Gelman (2006, 2007) discuss some 
ways in which R can be modified to allow more direct manipulation of random variable 
objects and Bayesian inferences. 

Information about Bugs appears at Spiegelhalter et al. (1994, 2003) and Plummer (2003), 
respectively. The article by Lunn et al. (2009) and ensuing discussion give a sense of the 
scope of the Bugs project. Stan is discussed further in Section 12.6 and Appendix C of 
this book, as well as at Stan Development Team (2012). Several other efforts have been 
undertaken to develop Bayesian inference tools for particular classes of model, for example 
Daume (2008). 


10.9 Exercises 


The exercises in Part III focus on computational details. Data analysis exercises using the 
methods described in this part of the book appear in the appropriate chapters in Parts IV 
and V. 


1. Number of simulation draws: Suppose the scalar variable 0 is approximately normally 
distributed in a posterior distribution that is summarized by n independent simulation 
draws. How large does n have to be so that the 2.5% and 97.5% quantiles of @ are 
specified to an accuracy of 0.1 sd(6|y)? 


(a) Figure this out mathematically, without using simulation. 
(b) Check your answer using simulation and show your results. 


2. Number of simulation draws: suppose you are interested in inference for the parameter 6; 
in a multivariate posterior distribution, p(@|y). You draw 100 independent values 6 from 
the posterior distribution of 0 and find that the posterior density for 6, is approximately 
normal with mean of about 8 and standard deviation of about 4. 


(a) Using the average of the 100 draws of 6; to estimate the posterior mean, E(6;|y), what 
is the approximate standard deviation due to simulation variability? 


(b) About how many simulation draws would you need to reduce the simulation standard 
deviation of the posterior mean to 0.1 (thus justifying the presentation of results to 
one decimal place)? 


(c) A more usual summary of the posterior distribution of 6; is a 95% central posterior 
interval. Based on the data from 100 draws, what are the approximate simulation 
standard deviations of the estimated 2.5% and 97.5% quantiles of the posterior distri- 
bution? (Recall that the posterior density is approximately normal.) 


(d) About how many simulation draws would you need to reduce the simulation standard 
deviations of the 2.5% and 97.5% quantiles to 0.1? 


(e) In the eight-schools example of Section 5.5, we simulated 200 posterior draws. What 
are the approximate simulation standard deviations of the 2.5% and 97.5% quantiles 
for school A in Table 5.3? 


(f) Why was it not necessary, in practice, to simulate more than 200 draws for the SAT 
coaching example? 
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3. Posterior computations for the binomial model: suppose yı ~ Bin(nı, pı) is the number 
of successfully treated patients under an experimental new drug, and y2 ~ Bin(ne, p2) 
is the number of successfully treated patients under the standard treatment. Assume 
that yı and y2 are independent and assume independent beta prior densities for the two 
probabilities of success. Let nı =10, y1 =6, and ng =20, yo=10. Repeat the following for 
several different beta prior specifications. 

(a) Use simulation to find a 95% posterior interval for pı — p2 and the posterior probability 
that Pi > P2. 

(b) Numerically integrate to estimate the posterior probability that pı > po. 

4. Rejection sampling: 

(a) Prove that rejection sampling gives draws from p(6|y). 

(b) Why is the boundedness condition on p(6|y)/q(@) necessary for rejection sampling? 

5. Rejection sampling and importance sampling: Consider the model, y; ~ Binomial(n,, 0;), 
where 6; = logit ‘(a + Ba;), for j = 1,..., J, and with independent prior distributions, 
a ~ t4(0,27) and 6 ~ t4(0,1). Suppose J = 10, the xj values are randomly drawn from 
a U(0, 1) distribution, and nj ~ Poisson? (5), where Poisson* is the Poisson distribution 
restricted to positive values. 

(a) Sample a dataset at random from the model. 

(b) Use rejection sampling to get 1000 independent posterior draws from (a, 8). 

(c) Approximate the posterior density for (a, 8) by a normal centered at the posterior 
mode with covariance matrix fit to the curvature at the mode. 

(d) Take 1000 draws from the two-dimensional t4 distribution with that center and scale 
matrix and use importance sampling to estimate E(aly) and E(6|y) 

(e) Compute an estimate of effective sample size for importance sampling using (10.4) on 
page 266. 

6. Importance sampling when the importance weights are well behaved: consider a uni- 
variate posterior distribution, p(6|y), which we wish to approximate and then calculate 
moments of, using importance sampling from an unnormalized density, g(@). Suppose the 
posterior distribution is normal, and the approximation is t3 with mode and curvature 
matched to the posterior density. 

(a) Draw a sample of size S = 100 from the approximate density and compute the impor- 
tance ratios. Plot a histogram of the log importance ratios. 

(b) Estimate E(0|y) and var(@|y) using importance sampling. Compare to the true values. 

(c) Repeat (a) and (b) for S = 10,000. 

(d) Using the sample obtained in (c), compute an estimate of effective sample size using 
(10.4) on page 266. 

7. Importance sampling when the importance weights are too variable: repeat the previous 
exercise, but with a t3 posterior distribution and a normal approximation. Explain why 
the estimates of var(0|y) are systematically too low. 

8. Importance resampling with and without replacement: 

(a) Consider the bioassay example introduced in Section 3.7. Use importance resampling 
to approximate draws from the posterior distribution of the parameters (a, 3), using 
the normal approximation of Section 4.1 as the starting distribution. Sample S = 
10,000 from the approximate distribution, and resample without replacement k = 
1000 samples. Compare your simulations of (a, 8) to Figure 3.3b and discuss any 
discrepancies. 

(b) Comment on the distribution of the simulated importance ratios. 


(c) Repeat part (a) using importance sampling with replacement. Discuss how the results 
differ. 
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Chapter 11 


Basics of Markov chain simulation 


Many clever methods have been devised for constructing and sampling from arbitrary pos- 
terior distributions. Markov chain simulation (also called Markov chain Monte Carlo, or 
MCMC) is a general method based on drawing values of 0 from approximate distributions 
and then correcting those draws to better approximate the target posterior distribution, 
p(6|y). The sampling is done sequentially, with the distribution of the sampled draws de- 
pending on the last value drawn; hence, the draws form a Markov chain. (As defined in 
probability theory, a Markov chain is a sequence of random variables 6!,@?,..., for which, 
for any t, the distribution of 6t given all previous 6’s depends only on the most recent value, 
9-1.) The key to the method’s success, however, is not the Markov property but rather that 
the approximate distributions are improved at each step in the simulation, in the sense of 
converging to the target distribution. As we shall see in Section 11.2, the Markov property 
is helpful in proving this convergence. 

Figure 11.1 illustrates a simple example of a Markov chain simulation—in this case, a 
Metropolis algorithm (see Section 11.2) in which @ is a vector with only two components, 
with a bivariate unit normal posterior distribution, 0 ~ N(0, J). First consider Figure 11.1a, 
which portrays the early stages of the simulation. The space of the figure represents the 
range of possible values of the multivariate parameter, 0, and each of the five jagged lines 
represents the early path of a random walk starting near the center or the extremes of 
the target distribution and jumping through the distribution according to an appropriate 
sequence of random iterations. Figure 11.1b represents the mature stage of the same Markov 
chain simulation, in which the simulated random walks have each traced a path throughout 
the space of 0, with a common stationary distribution that is equal to the target distribution. 
We can then perform inferences about 0 using points from the second halves of the Markov 
chains we have simulated, as displayed in Figure 11.1c. 

In our applications of Markov chain simulation, we create several independent sequences; 
each sequence, 6', 07,60°,..., is produced by starting at some point 6° and then, for each t, 
drawing 6t from a transition distribution, T;,(0°|0’—') that depends on the previous draw, 
6*-!. As we shall see in the discussion of combining the Gibbs sampler and Metropolis 
sampling in Section 11.3, it is often convenient to allow the transition distribution to depend 
on the iteration number t; hence the notation T}. The transition probability distributions 
must be constructed so that the Markov chain converges to a unique stationary distribution 
that is the posterior distribution, p(6|y). 

Markov chain simulation is used when it is not possible (or not computationally efficient) 
to sample @ directly from p(6|y); instead we sample iteratively in such a way that at each 
step of the process we expect to draw from a distribution that becomes closer to p(@|y). For 
a wide class of problems (including posterior distributions for many hierarchical models), 
this appears to be the easiest way to get reliable results. In addition, Markov chain and 
other iterative simulation methods have many applications outside Bayesian statistics, such 
as optimization, that we do not discuss here. 

The key to Markov chain simulation is to create a Markov process whose stationary dis- 
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Figure 11.1 Five independent sequences of a Markov chain simulation for the bivariate unit normal 
distribution, with overdispersed starting points indicated by solid squares. (a) After 50 iterations, 
the sequences are still far from convergence. (b) After 1000 iterations, the sequences are nearer to 
convergence. Figure (c) shows the iterates from the second halves of the sequences; these represent 
a set of (correlated) draws from the target distribution. The points in Figure (c) have been jittered 
so that steps in which the random walks stood still are not hidden. The simulation is a Metropolis 
algorithm described in the example on page 278, with a jumping rule that has purposely been cho- 
sen to be inefficient so that the chains will move slowly and their random-walk-like aspect will be 
apparent. 


tribution is the specified p(@|y) and to run the simulation long enough that the distribution 
of the current draws is close enough to this stationary distribution. For any specific p(6|y), 
or unnormalized density q(0|y), a variety of Markov chains with the desired property can 
be constructed, as we demonstrate in Sections 11.1—11.3. 

Once the simulation algorithm has been implemented and the simulations drawn, it 
is absolutely necessary to check the convergence of the simulated sequences; for example, 
the simulations of Figure 11.la are far from convergence and are not close to the target 
distribution. We discuss how to check convergence in Section 11.4, and in Section 11.5 we 
construct an expression for the effective number of simulation draws for a correlated sample. 
If convergence is painfully slow, the algorithm should be altered, as discussed in the next 
chapter. 

This chapter introduces the basic Markov chain simulation methods—the Gibbs sampler 
and the Metropolis-Hastings algorithm—in the context of our general computing approach 
based on successive approximation. We sketch a proof of the convergence of Markov chain 
simulation algorithms and present a method for monitoring the convergence in practice. 
We illustrate these methods in Section 11.6 for a hierarchical normal model. For most of 
this chapter we consider simple and familiar (even trivial) examples in order to focus on 
the principles of iterative simulation methods as they are used for posterior simulation. 
Many examples of these methods appear in the recent statistical literature and also in the 
later parts this book. Appendix C shows the details of implementation in the computer 
languages R and Stan for the educational testing example from Chapter 5. 


11.1 Gibbs sampler 


A particular Markov chain algorithm that has been found useful in many multidimensional 
problems is the Gibbs sampler, also called alternating conditional sampling, which is defined 
in terms of subvectors of 0. Suppose the parameter vector 0 has been divided into d 
components or subvectors, 6 = (61,...,9a). Each iteration of the Gibbs sampler cycles 
through the subvectors of 0, drawing each subset conditional on the value of all the others. 
There are thus d steps in iteration t. At each iteration t, an ordering of the d subvectors of 
0 is chosen and, in turn, each 0% is sampled from the conditional distribution given all the 
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Figure 11.2 Four independent sequences of the Gibbs sampler for a bivariate normal distribution 
with correlation p = 0.8, with overdispersed starting points indicated by solid squares. (a) First 10 
iterations, showing the componentwise updating of the Gibbs iterations. (b) After 500 iterations, 
the sequences have reached approximate convergence. Figure (c) shows the points from the second 
halves of the sequences, representing a set of correlated draws from the target distribution. 


other components of 0: 
p(9;|0%5", y), 


where or represents all the components of 0, except for 0j, at their current values: 


D = Oee eee epee ae 
Thus, each subvector 6; is updated conditional on the latest values of the other components 
of 0, which are the iteration t values for the components already updated and the iteration 
t — 1 values for the others. 

For many problems involving standard statistical models, it is possible to sample di- 
rectly from most or all of the conditional posterior distributions of the parameters. We 
typically construct models using a sequence of conditional probability distributions, as in 
the hierarchical models of Chapter 5. It is often the case that the conditional distributions 
in such models are conjugate distributions that provide for easy simulation. We present an 
example for the hierarchical normal model at the end of this chapter and another detailed 
example for a normal-mixture model in Section 22.2. Here, we illustrate the workings of 
the Gibbs sampler with a simple example. 

Example. Bivariate normal distribution 

Consider a single observation (y1, y2) from a bivariate normally distributed population 

with unknown mean 0 = (01,02) and known covariance matrix (G ?). With a uniform 

prior distribution on 0, the posterior distribution is 


OEO 


Although it is simple to draw directly from the joint posterior distribution of (01, 02), 
for the purpose of exposition we demonstrate the Gibbs sampler here. We need the 
conditional posterior distributions, which, from the properties of the multivariate nor- 
mal distribution (either equation (A.1) or (A.2) on page 582), are 


6;|02,y ~ N(yı +p02-— y2), 1— p°) 
b2lð1,y ~ N(y2+p(1—y1), 1- p°). 


The Gibbs sampler proceeds by alternately sampling from these two normal distribu- 
tions. In general, we would say that a natural way to start the iterations would be 
with random draws from a normal approximation to the posterior distribution; such 
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draws would eliminate the need for iterative simulation in this trivial example. Fig- 
ure 11.2 illustrates for the case p = 0.8, data (y1, y2) = (0,0), and four independent 
sequences started at (+2.5,+2.5). 


11.2 Metropolis and Metropolis-Hastings algorithms 


The Metropolis-Hastings algorithm is a general term for a family of Markov chain simula- 
tion methods that are useful for sampling from Bayesian posterior distributions. We have 
already seen the Gibbs sampler in the previous section; it can be viewed as a special case 
of Metropolis-Hastings (as described in Section 11.3). Here we present the basic Metropolis 
algorithm and its generalization to the Metropolis-Hastings algorithm. 


The Metropolis algorithm 


The Metropolis algorithm is an adaptation of a random walk with an acceptance/rejection 
rule to converge to the specified target distribution. The algorithm proceeds as follows. 


1. Draw a starting point 0°, for which p(0°|y) > 0, from a starting distribution po(0). The 
starting distribution might, for example, be based on an approximation as described 
in Section 13.3. Or we may simply choose starting values dispersed around a crude 
approximate estimate of the sort discussed in Chapter 10. 

2. Fort =1,2,...: 

(a) Sample a proposal 6* from a jumping distribution (or proposal distribution) at time 
t, J,(0*|0°—'). For the Metropolis algorithm (but not the Metropolis-Hastings algo- 
rithm, as discussed later in this section), the jumping distribution must be symmetric, 
satisfying the condition J;(6q|05) = J:(,|@2) for all 04, 05, and t. 

(b) Calculate the ratio of the densities, 


_ p(y) 
= 5B" (11.1) 


(c) Set 


gt = (i with probability min(r, 1) 
0'-! otherwise. 


Given the current value 6*7}, the transition distribution T,(0t|0t71) of the Markov chain 
is thus a mixture of a point mass at 0t = 6*—!, and a weighted version of the jumping 
distribution, J;(0°|0*—'), that adjusts for the acceptance rate. 

The algorithm requires the ability to calculate the ratio r in (11.1) for all (0, 6*), and to 
draw @ from the jumping distribution J;(6*|@) for all @ and t. In addition, step (c) above 
requires the generation of a uniform random number. 

When 6t = 6'~!—that is, if the jump is not accepted—this still counts as an iteration 
in the algorithm. 


Example. Bivariate unit normal density with normal jumping kernel 

For simplicity, we illustrate the Metropolis algorithm with the simple example of the 
bivariate unit normal distribution. The target density is the bivariate unit normal, 
p(Oly) = N(0@|0, I), where I is the 2 x 2 identity matrix. The jumping distribution 
is also bivariate normal, centered at the current iteration and scaled to 1/5 the size: 
Ji(0*|0°-1) = N(0*|0°-1, 0.277). At each step, it is easy to calculate the density 
ratio r = N(6*|0, 2)/N(6**+|0, T). It is clear from the form of the normal distribution 
that the jumping rule is symmetric. Figure 11.1 on page 276 displays five simulation 
runs starting from different points. We have purposely set the scale of this jumping 
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algorithm to be too small, relative to the target distribution, so that the algorithm 
will run inefficiently and its random-walk aspect will be obvious in the figure. In 
Section 12.2 we discuss how to set the jumping scale to optimize the efficiency of the 
Metropolis algorithm. 


Relation to optimization 


The acceptance/rejection rule of the Metropolis algorithm can be stated as follows: (a) if 
the jump increases the posterior density, set 6° = 6*; (b) if the jump decreases the posterior 
density, set 6! = 6* with probability equal to the density ratio, r, and set 6 = 9¢~1 
otherwise. The Metropolis algorithm can thus be viewed as a stochastic version of a stepwise 
mode-finding algorithm, always accepting steps that increase the density but only sometimes 
accepting downward steps. 


Why does the Metropolis algorithm work? 


The proof that the sequence of iterations 01, 0?,... converges to the target distribution has 
two steps: first, it is shown that the simulated sequence is a Markov chain with a unique 
stationary distribution, and second, it is shown that the stationary distribution equals this 
target distribution. The first step of the proof holds if the Markov chain is irreducible, 
aperiodic, and not transient. Except for trivial exceptions, the latter two conditions hold 
for a random walk on any proper distribution, and irreducibility holds as long as the random 
walk has a positive probability of eventually reaching any state from any other state; that 
is, the jumping distributions J; must eventually be able to jump to all states with positive 
probability. 

To see that the target distribution is the stationary distribution of the Markov chain 
generated by the Metropolis algorithm, consider starting the algorithm at time t — 1 with 
a draw 0*7} from the target distribution p(6|y). Now consider any two such points 0, and 
6,, drawn from p(6|y) and labeled so that p(,|y) > p(@aly). The unconditional probability 
density of a transition from 6, to 4 is 


p(T! =O, 0 =O) = p(Oaly) Je(91|9a), 


where the acceptance probability is 1 because of our labeling of a and b, and the uncondi- 
tional probability density of a transition from 6, to 0, is, from (11.1), 


ron Se) 
P(Oa ly) Jt (0a|9), 


which is the same as the probability of a transition from ĝa to 0), since we have required 
that J;(-|-) be symmetric. Since their joint distribution is symmetric, 6t and 6*~! have 
the same marginal distributions, and so p(6|y) is the stationary distribution of the Markov 
chain of 0. For more detailed theoretical concerns, see the bibliographic note at the end of 
this chapter. 


p(0° =0a, 0-1 =O) 


The Metropolis-Hastings algorithm 


The Metropolis-Hastings algorithm generalizes the basic Metropolis algorithm presented 
above in two ways. First, the jumping rules J; need no longer be symmetric; that is, there 
is no requirement that J:(6.|00) = J:(05|@a). Second, to correct for the asymmetry in the 
jumping rule, the ratio r in (11.1) is replaced by a ratio of ratios: 


P(O*|y)/ Je(0*|8") 


"= pO Ny) (TO 0") (11.2) 
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(The ratio r is always defined, because a jump from 6‘~! to 6* can only occur if both 
p(6*—"|y) and J:(0*|0*—1) are nonzero.) 

Allowing asymmetric jumping rules can be useful in increasing the speed of the random 
walk. Convergence to the target distribution is proved in the same way as for the Metropolis 
algorithm. The proof of convergence to a unique stationary distribution is identical. To 
prove that the stationary distribution is the target distribution, p(@|y), consider any two 
points ĝa and @ with posterior densities labeled so that p(9)|y)J:(@a|05) > p(Oaly) J: (Op |Aa)- 
If 6*—! follows the target distribution, then it is easy to show that the unconditional prob- 
ability density of a transition from 6, to 6) is the same as the reverse transition. 


Relation between the jumping rule and efficiency of simulations 


The ideal Metropolis-Hastings jumping rule is simply to sample the proposal, 0*, from 
the target distribution; that is, J(0*|0) = p(@*|y) for all 0. Then the ratio r in (11.2) is 
always exactly 1, and the iterates 0’ are a sequence of independent draws from p(@|y). In 
general, however, iterative simulation is applied to problems for which direct sampling is 
not possible. 

A good jumping distribution has the following properties: 
e For any 0, it is easy to sample from J(0*|6). 
e It is easy to compute the ratio r. 


e Each jump goes a reasonable distance in the parameter space (otherwise the random 
walk moves too slowly). 


e The jumps are not rejected too frequently (otherwise the random walk wastes too much 
time standing still). 


We return to the topic of constructing efficient simulation algorithms in the next chapter. 


11.3 Using Gibbs and Metropolis as building blocks 


The Gibbs sampler and the Metropolis algorithm can be used in various combinations to 
sample from complicated distributions. The Gibbs sampler is the simplest of the Markov 
chain simulation algorithms, and it is our first choice for conditionally conjugate models, 
where we can directly sample from each conditional posterior distribution. For example, we 
could use the Gibbs sampler for the normal-normal hierarchical models in Chapter 5. 

The Metropolis algorithm can be used for models that are not conditionally conjugate, 
for example, the two-parameter logistic regression for the bioassay experiment in Section 
3.7. In this example, the Metropolis algorithm could be performed in vector form—jumping 
in the two-dimensional space of (a, 8)—or embedded within a Gibbs sampler structure, by 
alternately updating a and 8 using one-dimensional Metropolis jumps. In either case, the 
Metropolis algorithm will probably have to be tuned to get a good acceptance rate, as 
discussed in Section 12.2. 

If some of the conditional posterior distributions in a model can be sampled from directly 
and some cannot, then the parameters can be updated one at a time, with the Gibbs 
sampler used where possible and one-dimensional Metropolis updating used otherwise. More 
generally, the parameters can be updated in blocks, where each block is altered using the 
Gibbs sampler or a Metropolis jump of the parameters within the block. 

A general problem with conditional sampling algorithms is that they can be slow when 
parameters are highly correlated in the target distribution (for example, see Figure 11.2 on 
page 277). This can be fixed in simple problems using reparameterization (see Section 12.1) 
or more generally using the more advanced algorithms mentioned in Chapter 12. 
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Interpretation of the Gibbs sampler as a special case of the Metropolis-Hastings algorithm 


Gibbs sampling can be viewed as a special case of the Metropolis-Hastings algorithm in 
the following way. We first define iteration t to consist of a series of d steps, with step j 
of iteration t corresponding to an update of the subvector 0; conditional on all the other 
elements of 6. Then the jumping distribution, J;;(-|-), at step j of iteration t only jumps 
along the jth subvector, and does so with the conditional posterior density of 0; given Jaa 


0*0 i =o 
Tee) =i r j k a ahesne 
The only possible jumps are to parameter vectors 0* that match 6’~! on all components 
other than the jth. Under this jumping distribution, the ratio (11.2) at the jth step of 
iteration t is 
C 
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and thus every jump is accepted. The second line above follows from the first because, 
under this jumping rule, 6* differs from 6*7! only in the jth component. The third line 
follows from the second by applying the rules of conditional probability to 0 = (0;,0—;) and 
noting that 6* , = 07". 

Usually, one iteration of the Gibbs sampler is defined as we do, to include all d steps 
corresponding to the d components of 0, thereby updating all of 0 at each iteration. It is 
possible, however, to define Gibbs sampling without the restriction that each component 
be updated in each iteration, as long as each component is updated periodically. 


Gibbs sampler with approximations 


For some problems, sampling from some, or all, of the conditional distributions p(0;|0_;, y) 
is impossible, but one can construct approximations, which we label g(0;|6_;), from which 
sampling is possible. The general form of the Metropolis-Hastings algorithm can be used 
to compensate for the approximation. As in the Gibbs sampler, we choose an order for 
altering the d elements of 0; the jumping function at the jth Metropolis step at iteration t 


is then p i 
_ (g*|pt—-1y) — 9 (95 |0_; ) if 0%; = Uz 
AE = { 0 otherwise, 


and the ratio r in (11.2) must be computed and the acceptance or rejection of 0* decided. 


11.4 Inference and assessing convergence 


The basic method of inference from iterative simulation is the same as for Bayesian sim- 
ulation in general: use the collection of all the simulated draws from p(@|y) to summarize 
the posterior density and to compute quantiles, moments, and other summaries of interest 
as needed. Posterior predictive simulations of unobserved outcomes y can be obtained by 
simulation conditional on the drawn values of 0. Inference using the iterative simulation 
draws requires some care, however, as we discuss in this section. 
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Difficulties of inference from iterative simulation 


Iterative simulation adds two challenges to simulation inference. First, if the iterations 
have not proceeded long enough, as in Figure 11.la, the simulations may be grossly un- 
representative of the target distribution. Even when simulations have reached approximate 
convergence, early iterations still reflect the starting approximation rather than the target 
distribution; for example, consider the early iterations of Figures 11.1b and 11.2b. 

The second problem with iterative simulation draws is their within-sequence correlation; 
aside from any convergence issues, simulation inference from correlated draws is generally 
less precise than from the same number of independent draws. Serial correlation in the 
simulations is not necessarily a problem because, at convergence, the draws are identically 
distributed as p(6|y), and so when performing inferences, we ignore the order of the simula- 
tion draws in any case. But such correlation can cause inefficiencies in simulations. Consider 
Figure 11.1c, which displays 500 successive iterations from each of five simulated sequences 
of the Metropolis algorithm: the patchy appearance of the scatterplot would not be likely 
to appear from 2500 independent draws from the normal distribution but is rather a result 
of the slow movement of the simulation algorithm. In some sense, the ‘effective’ number 
of simulation draws here is far fewer than 2500. We calculate effective sample size using 
formula (11.8) on page 287. 

We handle the special problems of iterative simulation in three ways. First, we attempt 
to design the simulation runs to allow effective monitoring of convergence, in particular by 
simulating multiple sequences with starting points dispersed throughout parameter space, 
as in Figure 11.la. Second, we monitor the convergence of all quantities of interest by 
comparing variation between and within simulated sequences until ‘within’ variation roughly 
equals ‘between’ variation, as in Figure 11.1b. Only when the distribution of each simulated 
sequence is close to the distribution of all the sequences mixed together can they all be 
approximating the target distribution. Third, if the simulation efficiency is unacceptably 
low (in the sense of requiring too much real time on the computer to obtain approximate 
convergence of posterior inferences for quantities of interest), the algorithm can be altered, 
as we discuss in Sections 12.1 and 12.2. 


Discarding early iterations of the simulation runs 


To diminish the influence of the starting values, we generally discard the first half of each 
sequence and focus attention on the second half. Our inferences will be based on the 
assumption that the distributions of the simulated values 6‘, for large enough t, are close 
to the target distribution, p(@|y). We refer to the practice of discarding early iterations in 
Markov chain simulation as warm-up; depending on the context, different warm-up fractions 
can be appropriate. For example, in the Gibbs sampler displayed in Figure 11.2, it would 
be necessary to discard only a few initial iterations.' We adopt the general practice of 
discarding the first half as a conservative choice. For example, we might run 200 iterations 
and discard the first half. If approximate convergence has not yet been reached, we might 
then run another 200 iterations, now discarding all of the initial 200 iterations. 


Dependence of the iterations in each sequence 


Another issue that sometimes arises, once approximate convergence has been reached, is 
whether to thin the sequences by keeping every kth simulation draw from each sequence 


lIn the simulation literature (including earlier editions of this book), the warm-up period is called burn- 
in, a term we now avoid because we feel it draws a misleading analogy to industrial processes in which 
products are stressed in order to reveal defects. We prefer the term ‘warm-up’ to describe the early phase 
of the simulations in which the sequences get closer to the mass of the distribution. 
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Figure 11.3 Examples of two challenges in assessing convergence of iterative simulations. (a) In 
the left plot, either sequence alone looks stable, but the juxtaposition makes it clear that they have 
not converged to a common distribution. (b) In the right plot, the two sequences happen to cover a 
common distribution but neither sequence appears stationary. These graphs demonstrate the need 
to use between-sequence and also within-sequence information when assessing convergence. 


and discarding the rest. In our applications, we have found it useful to skip iterations in 
problems with large numbers of parameters where computer storage is a problem, perhaps 
setting k so that the total number of iterations saved is no more than 1000. 

Whether or not the sequences are thinned, if the sequences have reached approximate 
convergence, they can be directly used for inferences about the parameters 0 and any other 
quantities of interest. 


Multiple sequences with overdispersed starting points 


Our recommended approach to assessing convergence of iterative simulation is based on 
comparing different simulated sequences, as illustrated in Figure 11.1 on page 276, which 
shows five parallel simulations before and after approximate convergence. In Figure 11.1la, 
the multiple sequences clearly have not converged; the variance within each sequence is 
much less than the variance between sequences. Later, in Figure 11.1b, the sequences have 
mixed, and the two variance components are essentially equal. 

To see such disparities, we clearly need more than one independent sequence. Thus our 
plan is to simulate independently at least two sequences, with starting points drawn from 
an overdispersed distribution (either from a crude estimate such as discussed in Section 10.2 
or a more elaborate approximation as discussed in the next chapter). 


Monitoring scalar estimands 


We monitor each scalar estimand or other scalar quantities of interest separately. Estimands 
include all the parameters of interest in the model and any other quantities of interest (for 
example, the ratio of two parameters or the value of a predicted future observation). It is 
often useful also to monitor the value of the logarithm of the posterior density, which has 
probably already been computed if we are using a version of the Metropolis algorithm. 


Challenges of monitoring convergence: mixing and stationarity 


Figure 11.3 illustrates two of the challenges of monitoring convergence of iterative simu- 
lations. The first graph shows two sequences, each of which looks fine on its own (and, 
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indeed, when looked at separately would satisfy any reasonable convergence criterion), but 
when looked at together reveal a clear lack of convergence. Figure 11.3a illustrates that, to 
achieve convergence, the sequences must together have mized. 

The second graph in Figure 11.3 shows two chains that have mixed, in the sense that 
they have traced out a common distribution, but they do not appear to have converged. 
Figure 11.3b illustrates that, to achieve convergence, each individual sequence must reach 
stationarity. 


Splitting each saved sequence into two parts 


We diagnose convergence (as noted above, separately for each scalar quantity of interest) 
by checking mixing and stationarity. There are various ways to do this; we apply a fairly 
simple approach in which we split each chain in half and check that all the resulting half- 
sequences have mixed. This simultaneously tests mixing (if all the chains have mixed well, 
the separate parts of the different chains should also mix) and stationarity (at stationarity, 
the first and second half of each sequence should be traversing the same distribution). 

We start with some number of simulated sequences in which the warm-up period (which 
by default we set to the first half of the simulations) has already been discarded. We then 
take each of these chains and split into the first and second half (this is all after discarding 
the warm-up iterations). Let m be the number of chains (after splitting) and n be the 
length of each chain. We always simulate at least two sequences so that we can observe 
mixing; see Figure 11.3a; thus m is always at least 4. 

For example, suppose we simulate 5 chains, each of length 1000, and then discard the 
first half of each as warm-up. We are then left with 5 chains, each of length 500, and we split 
each into two parts: iterations 1-250 (originally iterations 501-750) and iterations 251-500 
(originally iterations 751-1000). We now have m = 10 chains, each of length n = 250. 


Assessing mixing using between- and within-sequence variances 


For each scalar estimand 7, we label the simulations as pij (i=1,...,n;j=1,...,m), and 
we compute B and W, the between- and within-sequence variances: 


(a ae = i = Lan 
B = -a Te , where Bi= = Dd, ¥.=— DD; 
j=1 t=1 j=l 
W = Ira where s? = l See a 
ma 3? J 1S J J 


The between-sequence variance, B, contains a factor of n because it is based on the variance 
of the within-sequence means, Y. j, each of which is an average of n values Pij. 

We can estimate var(w|y), the marginal posterior variance of the estimand, by a weighted 
average of W and B, namely 


=l 1 
W+—B. (11.3) 


n 


wart ly) = 2 


This quantity overestimates the marginal posterior variance assuming the starting distri- 
bution is appropriately overdispersed, but is unbiased under stationarity (that is, if the 
starting distribution equals the target distribution), or in the limit n — co (see Exercise 
11.5). This is analogous to the classical variance estimate with cluster sampling. 
Meanwhile, for any finite n, the ‘within’ variance W should be an underestimate of 
var(p|y) because the individual sequences have not had time to range over all of the target 
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Number of 95% intervals and R for... 
iterations 01 02 log p(01, 82|y) 
50 —2.14, 3.74], 12.3 [—1.83, 2.70], 6.1 [-8.71, —0.17], 6.1 
500 —3.17, 1.74], 1.3 [-2.17, 2.09], 1.7 [—5.23, —0.07], 1.3 
2000 —1.83, 2.24], 1.2 [—1.74, 2.09], 1.03 [—4.07, —0.03], 1.10 
5000 —2.09, 1.98], 1.02 [—1.90, 1.95], 1.03 [—3.70, —0.03], 1.00 
o0 —1.96, 1.96], 1 [—1.96, 1.96], 1 [—3.69, —0.03], 1 


Table 11.1 95% central intervals and estimated potential scale reduction factors for three scalar sum- 
maries of the bivariate normal distribution simulated using a Metropolis algorithm. (For demon- 
stration purposes, the jumping scale of the Metropolis algorithm was purposely set to be inefficient; 
see Figure 11.1.) Displayed are inferences from the second halves of five parallel sequences, stopping 
after 50, 500, 2000, and 5000 iterations. The intervals for co are taken from the known normal 
and x3/2 marginal distributions for these summaries in the target distribution. 


distribution and, as a result, will have less variability; in the limit as n — oo, the expectation 
of W approaches var(?)|y). 

We monitor convergence of the iterative simulation by estimating the factor by which 
the scale of the current distribution for ~ might be reduced if the simulations were continued 
in the limit n —> oo. This potential scale reduction is estimated by? 


—~ + 
5 _ ,/ vat" ly) 
R=, 11.4 
— (11.4) 
which declines to 1 as n > oo. If the potential scale reduction is high, then we have reason 
to believe that proceeding with further simulations may improve our inference about the 
target distribution of the associated scalar estimand. 


Example. Bivariate unit normal density with bivariate normal jumping 
kernel (continued) 

We illustrate the multiple sequence method using the Metropolis simulations of the 
bivariate normal distribution illustrated in Figure 11.1. Table 11.1 displays posterior 
inference for the two parameters of the distribution as well as the log posterior density 
(relative to the density at the mode). After 50 iterations, the variance between the five 
sequences is much greater than the variance within, for all three univariate summaries 
considered. However, the five simulated sequences have converged adequately after 
2000 or certainly 5000 iterations for the quantities of interest. The comparison with the 
true target distribution shows how some variability remains in the posterior inferences 
even after the Markov chains have converged. (This must be so, considering that even 
if the simulation draws were independent, so that the Markov chains would converge 
in a single iteration, it would still require hundreds or thousands of draws to obtain 
precise estimates of extreme posterior quantiles.) 


The method of monitoring convergence presented here has the key advantage of not 
requiring the user to examine time series graphs of simulated sequences. Inspection of 
such plots is a notoriously unreliable method of assessing convergence and in addition is 
unwieldy when monitoring a large number of quantities of interest, such as can arise in 
complicated hierarchical models. Because it is based on means and variances, the simple 


?In the first edition of this book, R was defined as var (yly)/W. We have switched to the square-root 
definition for notational convenience. We have also made one major change since the second edition of the 
book. Our current R has the same formula as before, but we now compute it on the split chains, whereas 
previously we applied it to the entire chains unsplit. The unsplit R from the earlier editions of this book 
would not correctly diagnose the poor convergence in Figure 11.3b. 
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method presented here is most effective for quantities whose marginal posterior distributions 
are approximately normal. When performing inference for extreme quantiles, or for param- 
eters with multimodal marginal posterior distributions, one should monitor also extreme 
quantiles of the ‘between’ and ‘within’ sequences. 


11.5 Effective number of simulation draws 


Once the simulated sequences have mixed, we can compute an approximate ‘effective number 
of independent simulation draws’ for any estimand of interest Y. We start with the obser- 
vation that if the n simulation draws within each sequence were truly independent, then 
the between-sequence variance B would be an unbiased estimate of the posterior variance, 
var(w|y), and we would have a total of mn independent simulations from the m sequences. 
In general, however, the simulations of y within each sequence will be autocorrelated, and 
B will be larger than var(w|y), in expectation. 

One way to define effective sample size for correlated simulation draws is to consider the 
statistical efficiency of the average of the simulations 7., as an estimate of the posterior 
mean, E(w|y). This can be a reasonable baseline even though is not the only possible 
summary and might be inappropriate, for example, if there is particular interest in accurate 
representation of low-probability events in the tails of the distribution. 

Continuing with this definition, it is usual to compute effective sample size using the 
following asymptotic formula for the variance of the average of a correlated sequence: 


co 
lim mn varf.) = ( + 2a] var (wy), (11.5) 
n— oo t1 

where p+ is the autocorrelation of the sequence w at lag t. If the n simulation draws from 
each of the m chains were independent, then var(w,.) would simply be —_var(w|y) and the 
sample size would be mn. In the presence of correlation we then define the effective sample 


size as 
mn 


14200721 Pt 


The asymptotic nature of (11.5)—(11.6) might seem disturbing given that in reality we will 
only have a finite simulation, but this should not be a problem given that we already want 
to run the simulations long enough for approximate convergence to the (asymptotic) target 
distribution. 

To compute the effective sample size we need an estimate of the sum of the correlations 
p, for which we use information within and between sequences. We start by computing the 
total variance using the estimate var’ from (11.3); we then estimate the correlations by 
first computing the variogram V; at each lag t: 


1 m n 
u= mae) 5 >. (ij — piti)”. 


j=1 i=t+1 


(11.6) 


Neff 


We then estimate the correlations by inverting the formula, E(w; — i-t)? = 2(1 — pt )var (Y): 


a V 
= 1- ——. 11.7 
i ayar a 


Unfortunately we cannot simply sum all of these to estimate neg in (11.6); the difficulty is 
that for large values of t the sample correlation is too noisy. Instead we compute a partial 
sum, starting from lag 0 and continuing until the sum of autocorrelation estimates for two 
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successive lags Pot + P2t/41 is negative. We use this positive partial sum as our estimate of 
XOA] pe in (11.6). Putting this all together yields the estimate, 


E mn 
142a P 


where the estimated autocorrelations p; are computed from formula (11.7) and T is the first 
odd positive integer for which pr+1 + Pr+2 is negative. 

All these calculations should be performed using only the saved iterations, after discard- 
ing the warm-up period. For example, suppose we simulate 4 chains, each of length 1000, 
and then discard the first half of each as warm-up. Then m = 8, n = 250, and we compute 
variograms and correlations only for the saved iterations (thus, up to a maximum lag t of 
249, although in practice the stopping point T in (11.8) will be much lower). 


fien (11.8) 


Bounded or long-tailed distributions 


The above convergence diagnostics are based on means and variances, and they will not 
work so well for parameters or scalar summaries for which the posterior distribution, p(y), 
is far from Gaussian. (As discussed in Chapter 4, asymptotically the posterior distribution 
should typically be normally distributed as the data sample size approaches infinity, but 
(a) we are never actually at the asymptotic limit (in fact we are often interested in learning 
from small samples), and (b) it is common to have only a small amount of data on individual 
parameters that are part of a hierarchical model.) 

For summaries ¢ whose distributions are constrained or otherwise far from normal, 
we can preprocess simulations using transformations before computing the potential scale 
reduction factor R and the effective sample size Neg. We can take the logarithm of all- 
positive quantities, the logit of quantities that are constrained to fall in (0,1), and use 
the rank transformation for long-tailed distributions. Transforming the simulations to have 
well-behaved distributions should allow mean and variance-based convergence diagnostics 
to work better. 


Stopping the simulations 


We monitor convergence for the entire multivariate distribution, p(@|y), by computing the 
potential scale reduction factor (11.4) and the effective sample size (11.8) for each scalar 
summary of interest. (Recall that we are using 0 to denote the vector of unknowns in the 
posterior distribution, and ¢ to represent scalar summaries, considered one at a time.) 

We recommend computing the potential scale reduction for all scalar estimands of in- 
terest; if R is not near 1 for all of them, continue the simulation runs (perhaps altering 
the simulation algorithm itself to make the simulations more efficient, as described in the 
next section). The condition of R being ‘near’ 1 depends on the problem at hand, but we 
generally have been satisfied with setting 1.1 as a threshold. 

We can use effective sample size Neg to give us a sense of the precision obtained from our 
simulations. As we have discussed in Section 10.5, for many purposes it should suffice to have 
100 or even 10 independent simulation draws. (If neg = 10, the simulation standard error 
is increased by ,/1+1/10 = 1.05). As a default rule, we suggest running the simulation 
until “eg is at least 5m, that is, until there are the equivalent of at least 10 independent 
draws per sequence (recall that m is twice the number of sequences, as we have split each 
sequence into two parts so that R can assess stationarity as well as mixing). Having an 
effective sample size of 10 per sequence should typically correspond to stability of all the 
simulated sequences. For some purposes, more precision will be desired, and then a higher 
effective sample size threshold can be used. 
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Diet Measurements 
A 62, 60, 63, 59 
B 63, 67, 71, 64, 65, 66 
C 68, 66, 71, 67, 68, 68 
D 56, 62, 60, 61, 63, 64, 63, 59 


Table 11.2 Coagulation time in seconds for blood drawn from 24 animals randomly allocated to four 
different diets. Different treatments have different numbers of observations because the randomiza- 
tion was unrestricted. From Box, Hunter, and Hunter (1978), who adjusted the data so that the 
averages are integers, a complication we ignore in our analysis. 


Once R is near 1 and weg is more than 10 per chain for all scalar estimands of interest, 
just collect the mn simulations (with warm-up iterations already excluded, as noted before) 
and treat them as a sample from the target distribution. 

Even if an iterative simulation appears to converge and has passed all tests of con- 
vergence, it still may actually be far from convergence if important areas of the target 
distribution were not captured by the starting distribution and are not easily reachable by 
the simulation algorithm. When we declare approximate convergence, we are actually con- 
cluding that each individual sequence appears stationary and that the observed sequences 
have mixed well with each other. These checks are not hypothesis tests. There is no p- 
value and no statistical significance. We assess discrepancy from convergence via practical 
significance (or some conventional version thereof, such as R > 1.1). 


11.6 Example: hierarchical normal model 


We illustrate the simulation algorithms with a hierarchical normal model, extending the 
problem discussed in Section 5.4 by allowing an unknown data variance, 07. The example 
is continued in Section 13.6 to illustrate mode-based computation. We demonstrate with 
the normal model because it is simple enough that the key computational ideas do not get 
lost in the details. 


Data from a small experiment 


We demonstrate the computations on a small experimental dataset, displayed in Table 11.2, 
that has been used previously as an example in the statistical literature. Our purpose here 
is solely to illustrate computational methods, not to perform a full Bayesian data analysis 
(which includes model construction and model checking), and so we do not discuss the 
applied context. 


The model 
Under the hierarchical normal model (restated here, for convenience), data y;;, i = 1,..., nj, 
j =1,...,J, are independently normally distributed within each of J groups, with means 


6; and common variance o°. The total number of observations is n = ae nj. The group 


means are assumed to follow a normal distribution with unknown mean jz and variance 
T?, and a uniform prior distribution is assumed for (u,logo, T), with o > 0 and 7 > 0; 
equivalently, p(u, log o, log T) « r. If we were to assign a uniform prior distribution to logr, 
the posterior distribution would be improper, as discussed in Chapter 5. 


The joint posterior density of all the parameters is 


ws J nj 
pO, H, log T, log tly) XT JI N(8;|u, T’) JI II N(yi5 195, a”). 


j= j=1i=1 
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Starting points 


In this example, we can choose overdispersed starting points for each parameter 6; by simply 
taking random points from the data y;; from group j. We obtain 10 starting points for the 
simulations by drawing 6; independently in this way for each group. We also need starting 
points for u, which can be taken as the average of the starting 6; values. No starting values 
are needed for 7 or o as they can be drawn as the first steps in the Gibbs sampler. 

Section 13.6 presents a more elaborate procedure for constructing a starting distribution 
for the iterative simulations using the posterior mode and a normal approximation. 


Gibbs sampler 


The conditional distributions for this model all have simple conjugate forms: 

1. Conditional posterior distribution of each 0;. The factors in the joint posterior density 
that involve 0; are the N(u,7*) prior distribution and the normal likelihood from the 
data in the jth group, yi;,7 = 1,...,n;. The conditional posterior distribution of each 
0; given the other parameters in the model is 


Ojla, o, T, y ~ N(0;, Vo,), (11.9) 


where the parameters of the conditional posterior distribution depend on u, o, and T as 
well as y: 


z =A 
6 = 5 a (11.10) 
TZ 
1 
Vo, = = 11.11 
: att so 


These conditional distributions are independent; thus drawing the 0;’s one at a time is 
equivalent to drawing the vector 0 all at once from its conditional posterior distribution. 

2. Conditional posterior distribution of u. Conditional on y and the other parameters in 
the model, u has a normal distribution determined by the 6;’s: 


u\0,0,7,y ~ N(fi,7*/J), (11.12) 
where 
1 J 
A= ae (11.13) 


3. Conditional posterior distribution of o°. The conditional posterior density for ø? has the 
form corresponding to a normal variance with known mean; there are n observations yij 
with means 6;. The conditional posterior distribution is 


o7|0, u, T, Y ae Inv-x?(n, 67), (11.14) 


where 


Ng 


J 
a =" > (uy — 6) (11.15) 


j=l i=1 


4. Conditional posterior distribution of 72. Conditional on the data and the other param- 
eters in the model, T? has a scaled inverse-y? distribution, with parameters depending 
only on pu and 8 (as can be seen by examining the joint posterior density): 


T|0, uo, y ~ Inv-x?(J — 1,77), (11.16) 
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Estimand Posterior quantiles R 
2.5% 25% median 75% 97.5% 
4 58.9 60.6 61.3 62.1 63.5 1.01 
02 63.9 65.3 65.9 66.6 67.7 1.01 
03 66.0 67.1 67.8 68.5 69.5 1.01 
04 59.5 60.6 61.1 61.7 62.8 1.01 
H 56.9 62.2 63.9 65.5 73.4 1.04 
o 1.8 2.2 2.4 2.6 3.3 1.00 
T 2.1 3.6 4.9 7.6 26.6 1.05 


log p(, log o, log Tly) 67.6 64.3 63.4 62.6 62.0 1.02 
logp(0, u, logo, logrly) —70.6 —66.5 —65.1 —64.0 -62.4 1.01 


Table 11.3 Summary of inference for the coagulation example. Posterior quantiles and estimated 
potential scale reductions are computed from the second halves of ten Gibbs sampler sequences, 
each of length 100. Potential scale reductions for o and T are computed on the log scale. The 
hierarchical standard deviation, T, is estimated less precisely than the unit-level standard deviation, 
a, as is typical in hierarchical modeling with a small number of batches. 


with 
1 J 
a2 SEE 
=L 2l; uy“. (11.17) 


The expressions for 7? have (J — 1) degrees of freedom instead of J because p(T) « 1 


rather than 7~!. 


Numerical results with the coagulation data 


We illustrate the Gibbs sampler with the coagulation data of Table 11.2. Inference from 
ten parallel Gibbs sampler sequences appears in Table 11.3; 100 iterations were sufficient 
for approximate convergence. 


The Metropolis algorithm 


We also describe how the Metropolis algorithm can be used for this problem. It would be 
possible to apply the algorithm to the entire joint distribution, p(0, u, o, T|y), but we can 
work more efficiently in a lower-dimensional space by taking advantage of the conjugacy 
of the problem that allows us to compute the function p(u, logo, log rly), as we discuss in 
Section 13.6. We use the Metropolis algorithm to jump through the marginal posterior 
distribution of (,logo,log7) and then draw simulations of the vector 6 from its normal 
conditional posterior distribution (11.9). Following a principle of efficient Metropolis jump- 
ing that we shall discuss in Section 12.2, we jump through the space of (u, log o, log T) using 
a multivariate normal jumping kernel centered at the current value of the parameters and 
variance matrix equal to that of a normal approximation (see Section 13.6), multiplied by 
2.4? /d, where d is the dimension of the Metropolis jumping distribution. In this case, d = 3. 


Metropolis results with the coagulation data 


We ran ten parallel sequences of Metropolis algorithm simulations. In this case 500 iterations 
were sufficient for approximate convergence (R < 1.1 for all parameters); at that point we 
obtained similar results to those obtained using Gibbs sampling. The acceptance rate for 
the Metropolis simulations was 0.35, which is close to the expected result for the normal 
distribution with d = 3 using a jumping distribution scaled by 2.4/V/d (see Section 12.2). 
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11.7 Bibliographic note 


Gilks, Richardson, and Spiegelhalter (1996) is a book full of examples and applications of 
Markov chain simulation methods. Further references on Bayesian computation appear in 
the books by Tanner (1993), Chen, Shao, and Ibrahim (2000), and Robert, and Casella 
(2004). Many other applications of Markov chain simulation appear in the recent applied 
statistical literature. 

Metropolis and Ulam (1949) and Metropolis et al. (1953) apparently were the first 
to describe Markov chain simulation of probability distributions (that is, the ‘Metropolis 
algorithm’). Their algorithm was generalized by Hastings (1970); see Chib and Greenberg 
(1995) for an elementary introduction and Tierney (1998) for a theoretical perspective. The 
conditions for Markov chain convergence appear in probability texts such as Feller (1968), 
and more recent work such as Rosenthal (1995) has evaluated the rates of convergence of 
Markov chain algorithms for statistical models. The Gibbs sampler was first so named 
by Geman and Geman (1984) in a discussion of applications to image processing. Tanner 
and Wong (1987) introduced the idea of iterative simulation to many statisticians, using 
the special case of ‘data augmentation’ to emphasize the analogy to the EM algorithm 
(see Section 13.4). Gelfand and Smith (1990) showed how the Gibbs sampler could be 
used for Bayesian inference for a variety of important statistical models. The Metropolis- 
approximate Gibbs algorithm introduced at the end of Section 11.3 appears in Gelman 
(1992b) and is used by Gilks, Best, and Tan (1995). 

Gelfand et al. (1990) applied Gibbs sampling to a variety of statistical problems, and 
many other applications of Gibbs sampler algorithms have appeared since; for example, 
Clayton (1991) and Carlin and Polson (1991). Besag and Green (1993), Gilks et al. (1993), 
and Smith and Roberts (1993) discuss Markov simulation algorithms for Bayesian compu- 
tation. Bugs (Spiegelhalter et al., 1994, 2003) is a general-purpose computer program for 
Bayesian inference using the Gibbs sampler; see Appendix C for details. 

Inference and monitoring convergence from iterative simulation are reviewed by Gelman 
and Rubin (1992b) and Brooks and Gelman (1998), who provide a theoretical justification of 
the method presented in Section 11.4 and discuss more elaborate versions of the method; see 
also Brooks and Giudici (2000) and Gelman and Shirley (2011). Other views on assessing 
convergence appear in the ensuing discussion of Gelman and Rubin (1992b) and Geyer 
(1992) and in Cowles and Carlin (1996) and Brooks and Roberts (1998). Gelman and 
Rubin (1992a,b) and Glickman (1993) present examples of iterative simulation in which lack 
of convergence is impossible to detect from single sequences but is obvious from multiple 
sequences. The rule for summing autocorrelations and stopping after the sum of two is 
negative comes from Geyer (1992). 

Venna, Kaski, and Peltonen (2003) and Peltonen, Venna, and Kaski (2009) discuss 
graphical diagnostics for convergence of iterative simulations. 


11.8 Exercises 


1. Metropolis-Hastings algorithm: Show that the stationary distribution for the Metropolis- 
Hastings algorithm is, in fact, the target distribution, p(6|y). 


2. Metropolis algorithm: Replicate the computations for the bioassay example of Section 3.7 
using the Metropolis algorithm. Be sure to define your starting points and your jumping 
rule. Compute with log-densities (see page 261). Run the simulations long enough for 
approximate convergence. 


3. Gibbs sampling: Table 11.4 contains quality control measurements from 6 machines in 
a factory. Quality control measurements are expensive and time-consuming, so only 5 
measurements were done for each machine. In addition to the existing machines, we 
are interested in the quality of another machine (the seventh machine). Implement a 
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Machine Measurements 

1 83, 92 ,92, 46, 67 
117, 109, 114, 104, 87 
101, 93, 92, 86, 67 
105, 119, 116, 102, 116 
79, 97, 103, 79, 92 
57, 92, 104, 77, 100 


ao» UON 


Table 11.4: Quality control measurements from 6 machines in a factory. 


separate, a pooled and hierarchical Gaussian model with common variance described in 
Section 11.6. Run the simulations long enough for approximate convergence. Using each 
of three models—separate, pooled, and hierarchical—report: (i) the posterior distribu- 
tion of the mean of the quality measurements of the sixth machine, (ii) the predictive 
distribution for another quality measurement of the sixth machine, and (iii) the posterior 
distribution of the mean of the quality measurements of the seventh machine. 


4. Gibbs sampling: Extend the model in Exercise 11.3 by adding a hierarchical model for 
the variances of the machine quality measurements. Use an Inv-x? prior distribution 
for variances with unknown scale a and fixed degrees of freedom. (The data do not 
contain enough information for determining the degrees of freedom, so inference for 
that hyperparameter would depend very strongly on its prior distribution in any case). 
The conditional distribution of 0? is not of simple form, but you can sample from its 
distribution, for example, using grid sampling. 

5. Monitoring convergence: 


(a) Prove that var‘ (¢|y) as defined in (11.3) is an unbiased estimate of the marginal 
posterior variance of @, if the starting distribution for the Markov chain simulation 
algorithm is the same as the target distribution, and if the m parallel sequences are 
computed independently. (Hint: show that var‘ (|y) can be expressed as the average 
of the halved squared differences between simulations ġ from different sequences, and 
that each of these has expectation equal to the posterior variance.) 

(b) Determine the conditions under which var‘ (w|y) approaches the marginal posterior 
variance of @ in the limit as the lengths n of the simulated chains approach oo. 

6. Effective sample size: 

(a) Derive the asymptotic formula (11.5) for the variance of the average of correlated 
simulations. 

(b) Implement a Markov chain simulation for some example and plot Neg from (11.8) over 
time. Is neg stable? Does it gradually increase as a function of number of iterations, 
as one would hope? 

7. Analysis of survey data: Section 8.3 presents an analysis of a stratified sample survey 
using a hierarchical model on the stratum probabilities. 

(a) Perform the computations for the simple nonhierarchical model described in the ex- 
ample. 

(b) Using the Metropolis algorithm, perform the computations for the hierarchical model, 
using the results from part (a) as a starting distribution. Check by comparing your 
simulations to the results in Figure 8.1b. 
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Chapter 12 


Computationally efficient Markov chain 
simulation 


The basic Gibbs sampler and Metropolis algorithm can be seen as building blocks for more 
advanced Markov chain simulation algorithms that can work well for a wide range of prob- 
lems. In Sections 12.1 and 12.2, we discuss reparameterizations and settings of tuning 
parameters to make Gibbs and Metropolis more efficient. Section 12.4 presents Hamilto- 
nian Monte Carlo, a generalization of the Metropolis algorithm that includes ‘momentum’ 
variables so that each iteration can move farther in parameter space, thus allowing faster 
mixing, especially in high dimensions. We follow up in Sections 12.5 and 12.6 with an ap- 
plication to a hierarchical model and a discussion of our program Stan, which implements 
HMC for general models. 


12.1 Efficient Gibbs samplers 
Transformations and reparameterization 


The Gibbs sampler is most efficient when parameterized in terms of independent compo- 
nents; Figure 11.2 shows an example with highly dependent components that create slow 
convergence. The simplest way to reparameterize is by a linear transformation of the pa- 
rameters, but posterior distributions that are not approximately normal may require special 
methods. 

The same arguments apply to Metropolis jumps. In a normal or approximately normal 
setting, the jumping kernel should ideally have the same covariance structure as the target 
distribution, which can be approximately estimated based on the normal approximation at 
the mode (as we discussed in Chapter 13). Markov chain simulation of a distribution with 
multiple modes can be greatly improved by allowing jumps between modes. 


Auxiliary variables 


Gibbs sampler computations can often be simplified or convergence accelerated by adding 
auxiliary variables, for example indicators for mixture distributions, as described in Chapter 
22. The idea of adding variables is also called data augmentation and is often a useful 
conceptual and computational tool, both for the Gibbs sampler and for the EM algorithm 
(see Section 13.4). 


Example. Modeling the ¢ distribution as a mixture of normals 

A simple but important example of auxiliary variables arises with the ¢ distribution, 
which can be expressed as a mixture of normal distributions, as noted in Chapter 
3 and discussed in more detail in Chapter 17. We illustrate with the example of 
inference for the parameters u, o? given n independent data points from the t, (u, 07) 
distribution, where for simplicity we assume v is known. We also assume a uniform 
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prior distribution on 4, logo. The t likelihood for each data point is equivalent to the 
model, 


yvi ~ N(w,Vi) 
V ~ Inv-x?(v,07), (12.1) 


where the V;’s are auxiliary variables that cannot be directly observed. If we perform 
inference using the joint posterior distribution, p(w, 07, V|y), and then just consider the 
simulations for 4,0, these will represent the posterior distribution under the original 
t model. 

There is no direct way to sample the parameters u, g? in the t model, but it is straight- 
forward to perform the Gibbs sampler on V, u,a? in the augmented model: 


1. Conditional posterior distribution of each Vi. Conditional on the data y and the 
other parameters of the model, each V; is a normal variance parameter with a scaled 
inverse-\? prior distribution, and so its posterior distribution is also inverse-y? (see 
Section 2.6): 


2 pple 
V;|u, 0°, v, y ~ Inv-x (v tuy, 
v+1 


The n parameters V; are independent in their conditional posterior distribution, and 
we can directly apply the Gibbs sampler by sampling from their scaled inverse-y? 
distributions. 

2. Conditional posterior distribution of u. Conditional on the data y and the other 
parameters of the model, information about u is supplied by the n data points yi, 
each with its own variance. Combining with the uniform prior distribution on p 


yields, 
n i 
2 Xi yvi 1 
lo E V, V, y DA N <n r Sn I £ 
Davy Xiv 


3. Conditional posterior distribution of 07. Conditional on the data y and the other 
parameters of the model, all the information about ø comes from the variances Vj. 
The conditional posterior distribution is, 


nm 
p(o?|u,V,v,y) x o? II g e7”? /(2Vi) 
i=1 


from which we can sample directly. 


Parameter expansion 


For some problems, the Gibbs sampler can be slow to converge because of posterior de- 
pendence among parameters that cannot simply be resolved with a linear transformation. 
Paradoxically, adding an additional parameter—thus performing the random walk in a 
larger space—can improve the convergence of the Markov chain simulation. We illustrate 
with the t example above. 
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Example. Fitting the t model (continued) 

In the latent-parameter form (12.1) of the t model, convergence will be slow if a simu- 
lation draw of ø is close to zero, because the conditional distributions will then cause 
the V;’s to be sampled with values near zero, and then the conditional distribution of 
o will be near zero, and so on. Eventually the simulations will get unstuck but it can 
be slow for some problems. We can fix things by adding a new parameter whose only 
role is to allow the Gibbs sampler to move in more directions and thus avoid getting 
stuck. The expanded model is, 


yi ~ N(u,07U;) 

U; ~ Inv-x?(v,7°), 
where a > 0 can be viewed as an additional scale parameter. In this new model, a?U; 
plays the role of V; in (12.1) and ar plays the role of ø. The parameter a has no 
meaning on its own and we can assign it a noninformative uniform prior distribution 


on the logarithmic scale. 
The Gibbs sampler on this expanded model now has four steps: 


1. For each i, U; is updated much as V; was before: 


vT? + ((yi a . 


Ui , (ri, ~I -X° +1, 
ila, u, T“, v, y ~ Inv-x (- E] 


2. The mean, js, is updated as before: 


n 1 
i=1 z2r i 1 
ula, T’, U, v, y ~N pee Lo sr 


T T , m T 
Jaai aU; i=1 xU; 


3. The variance parameter 7?, is updated much as g? was before: 


“1 
T°|a, p, U, v, y ~ Gamma (z => =) : 


i=l 


4. Finally, we must update a?, which is easy since conditional on all the other param- 
eters in the model it is simply a normal variance parameter: 


2 2 2 l 
alu, T”, U, v, y ~ Inv-x? | n, = 5 o an , 
|u, 39, V, Y X ( 


i=l 


The parameters a?,U,r in this expanded model are not identified in that the data 
do not supply enough information to estimate each of them. However, the model as 
a whole is identified as long as we monitor convergence of the summaries H, 0 = aT, 
and V; = aU; for i = 1,...,n. (Or, if the only goal is inference for the original t 
model, we can simply save u and o from the simulations.) 

The Gibbs sampler under the expanded parameterizations converges more reliably 
because the new parameter a breaks the dependence between 7 and the V;’s. 


We discuss parameter expansion for hierarchical models in Section 15.5 and illustrate in 
Appendix C. 


12.2 Efficient Metropolis jumping rules 


For any given posterior distribution, the Metropolis-Hastings algorithm can be implemented 
in an infinite number of ways. Even after reparameterizing, there are still endless choices in 
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the jumping rules, J+. In many situations with conjugate families, the posterior simulation 
can be performed entirely or in part using the Gibbs sampler, which is not always efficient 
but generally is easy to program, as we illustrated with the hierarchical normal model in 
Section 11.6. For nonconjugate models we must rely on Metropolis-Hastings algorithms 
(either within a Gibbs sampler or directly on the multivariate posterior distribution). The 
choice of jumping rule then arises. 

There are two main classes of simple jumping rules. The first are essentially random 
walks around the parameter space. These jumping rules are often normal jumping kernels 
with mean equal to the current value of the parameter and variance set to obtain efficient 
algorithms. The second approach uses proposal distributions that are constructed to closely 
approximate the target distribution (either the conditional distribution of a subset in a 
Gibbs sampler or the joint posterior distribution). In the second case the goal is to accept as 
many draws as possible with the Metropolis-Hastings acceptance step being used primarily 
to correct the approximation. There is no natural advantage to altering one parameter at 
a time except for potential computational savings in evaluating only part of the posterior 
density at each step. 

It is hard to give general advice on efficient jumping rules, but some results have been 
obtained for random walk jumping distributions that have been useful in many problems. 
Suppose there are d parameters, and the posterior distribution of 0 = (61,...,0a), after 
appropriate transformation, is multivariate normal with known variance matrix ©. Further 
suppose that we will take draws using the Metropolis algorithm with a normal jumping 
kernel centered on the current point and with the same shape as the target distribution: 
that is, J(0*|0°—1) = N(0*|6*!, c?X). Among this class of jumping rules, the most efficient 
has scale c ~ 2.4/ Vd, where efficiency is defined relative to independent sampling from 
the posterior distribution. The efficiency of this optimal Metropolis jumping rule for the 
d-dimensional normal distribution can be shown to be about 0.3/d (by comparison, if the 
d parameters were independent in their posterior distribution, the Gibbs sampler would 
have efficiency 1/d, because after every d iterations, a new independent draw of 0 would 
be created). Which algorithm is best for any particular problem also depends on the 
computation time for each iteration, which in turn depends on the conditional independence 
and conjugacy properties of the posterior density. 

A Metropolis algorithm can also be characterized by the proportion of jumps that are 
accepted. For the multivariate normal random walk jumping distribution with jumping 
kernel the same shape as the target distribution, the optimal jumping rule has acceptance 
rate around 0.44 in one dimension, declining to about 0.23 in high dimensions (roughly 
d> 5). This result suggests an adaptive simulation algorithm: 


1. Start the parallel simulations with a fixed algorithm, such as a version of the Gibbs 
sampler, or the Metropolis algorithm with a normal random walk jumping rule shaped 
like an estimate of the target distribution (using the covariance matrix computed at the 
joint or marginal posterior mode scaled by the factor 2.4/Vd). 


2. After some number of simulations, update the Metropolis jumping rule as follows. 


(a) Adjust the covariance of the jumping distribution to be proportional to the posterior 
covariance matrix estimated from the simulations. 


(b) Increase or decrease the scale of the jumping distribution if the acceptance rate of the 
simulations is much too high or low, respectively. The goal is to bring the jumping 
rule toward the approximate optimal value of 0.44 (in one dimension) or 0.23 (when 
many parameters are being updated at once using vector jumping). 


This algorithm can be improved in various ways, but even in its simple form, we have found 
it useful for drawing posterior simulations for some problems with d ranging from 1 to 50. 
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Adaptive algorithms 


When an iterative simulation algorithm is ‘tuned’—that is, modified while it is running— 
care must be taken to avoid converging to the wrong distribution. If the updating rule 
depends on previous simulation steps, then the transition probabilities are more compli- 
cated than as stated in the Metropolis-Hastings algorithm, and the iterations will not in 
general converge to the target distribution. To see the consequences, consider an adap- 
tation that moves the algorithm more quickly through flat areas of the distribution and 
moves more slowly when the posterior density is changing rapidly. This would make sense 
as a way of exploring the target distribution, but the resulting simulations would spend 
disproportionately less time in the flat parts of the distribution and more time in variable 
parts; the resulting simulation draws would not match the target distribution unless some 
sort of correction is applied. 

To be safe, we typically run any adaptive algorithm in two phases: first, the adaptive 
phase, where the parameters of the algorithm can be tuned as often as desired to increase 
the simulation efficiency, and second, a fixed phase, where the adapted algorithm is run 
long enough for approximate convergence. Only simulations from the fixed phase are used 
in the final inferences. 


12.3 Further extensions to Gibbs and Metropolis 
Slice sampling 


A random sample of 0 from the d-dimensional target distribution, p(6|y), is equivalent to a 
random sample from the area under the distribution (for example, the shaded area under 
the curve in the illustration of rejection sampling in Figure 10.1 on page 264). Formally, 
sampling is performed from the d+1-dimensional distribution of (@,u), where, for any 6, 
p(O,uly) «x 1 for u € [0, p(6|y)] and 0 otherwise. Slice sampling refers to the application of 
iterative simulation algorithms on this uniform distribution. The details of implementing 
an effective slice sampling procedure can be complicated, but the method can be applied 
in great generality and can be especially useful for sampling one-dimensional conditional 
distributions in a Gibbs sampling structure. 


Reversible jump sampling for moving between spaces of differing dimensions 


In a number of settings it is desirable to carry out a trans-dimensional Markov chain simu- 
lation, in which the dimension of the parameter space can change from one iteration to the 
next. One example where this occurs is in model averaging where a single Markov chain 
simulation is constructed that includes moves among a number of plausible models (per- 
haps regression models with different sets of predictors). The ‘parameter space’ for such 
a Markov chain simulation includes the traditional parameters along with an indication of 
the current model. A second example includes finite mixture models (see Chapter 22) in 
which the number of mixture components is allowed to vary. 

It is still possible to perform the Metropolis algorithm in such settings, using the method 
of reversible jump sampling. We use notation corresponding to the case where a Markov 
chain moves among a number of candidate models. Let M;,k = 1,..., K, denote the 
candidate models and 6; the parameter vector for model k with dimension dk. A key 
aspect of the reversible jump approach is the introduction of additional random variables 
that enable the matching of parameter space dimensions across models. Specifically if a 
move from k to k* is being considered, then an auxiliary random variable u with jumping 
distribution J(ulk,k*,0;) is generated. A series of one-to-one deterministic functions are 
defined that do the dimension-matching with (6;«,u*) = gk k+ (0k, u) and dp + dim(u) = 
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dk» +dim(u*). The dimension matching ensures that the balance condition needed to prove 
the convergence of the Metropolis-Hastings algorithm in Chapter 11 continues to hold here. 
We present the reversible jump algorithm in general terms followed by an example. For 
the general description, let 7, denote the prior probability on model k, p(0,|M;) the prior 
distribution for the parameters in model k, and p(y|0,, Mp) the sampling distribution under 
model k. Reversible jump Markov chain simulation generates samples from p(k, 0p|y) using 
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the following three steps at each iteration: 


1. 


The resulting posterior draws provide inference about the posterior probability for each 


Starting in state (k,0%) (that is, model Mp with parameter vector 0%), propose a new 
model Mk» with probability Jk k+ and generate an augmenting random variable u from 


proposal density J(ulk, k*, 0x). 


2. Determine the proposed model’s parameters, (0,«,u*) = gk k* (Ox, u). 
3. Define the ratio 


_ p(y|Oxn*, Mg«)p(Ox« Me ) TK Jgs ST (u*|k*, k, 0p) iS (Ax, U) 
p(ylOx,Mx)p(Ox|Mx)re Jer T(ulk, k*, On) | VOOr u) 


and accept the new model with probability min(r, 1). 


model as well as the parameters under that model. 


Example. Testing a variance component in a logistic regression 

The application of reversible jump sampling, especially the use of the auxiliary random 
variables u, is seen most easily through an example. 

Consider a probit regression for survival of turtles in a natural selection experiment. 
Let y;; denote the binary response for turtle i in family j with Pr(yi; = 1) = pij 
for i = 1,...,nj and j = 1,...,J. The weight xij of the turtle is known to affect 
survival probability, and it is likely that familial factors also play a role. This suggests 
the model, p;; = (ao + a1 2;; + bj). It is natural to model the b;’s as exchangeable 
family effects, b; ~ N(0,7?). The prior distribution p(ag,a 1,7) is not central to this 
discussion so we do not discuss it further here. 

Suppose for the purpose of this example that we seek to test whether the variance 
component T is needed by running a Markov chain that considers the model with and 
without the varying intercepts, bj. As emphasized in Chapters 6-7, we much prefer 
to fit the model with the variance parameter and assess its importance by examining 
its posterior distribution. However, it might be of interest to consider the model that 
allows T = 0 as a discrete possibility, and we choose this example to illustrate the 
reversible jump algorithm. 

Let Mo denote the model with r = 0 (no variance component) and M, denote the 
model including the variance component. We use numerical integration to compute 
the marginal likelihood p(y|ao9,a1,7) for model Mı. Thus the b,’s are not part of the 
iterative simulation under model Mı. The reversible jump algorithm takes 79 = 7 = 
0.5 and Joo = Jo = Jio = Ji,1 = 0.5. At each step we either take a Metropolis step 
within the current model (with probability 0.5) or propose a jump to the other model. 
If we are in model 0 and are proposing a jump to model 1, then the auxiliary random 
variable is u ~ J(u) (scaled inverse-y? in this case) and we define the parameter vector 
for model 1 by setting T? = u and leaving ag and a; as they were in the previous 
iteration. The ratio (12.2) is then 


ne p(ylao, 1,77, My)p(7”) 
plylao, ai, Mo) I(r?) ’ 


because the prior distributions on œ and the models cancel, and the Jacobian of 
the transformation is 1. The candidate model is accepted with probability min(r, 1). 
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(12.2) 
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There is no auxiliary random variable for going from model 1 to model 0. In that case 
we merely set 7 = 0, and the acceptance probability is the reciprocal of the above. 
In the example we chose J(7?) based on a pilot analysis of model M, (an inverse-x? 
distribution matching the posterior mean and variance). 


Simulated tempering and parallel tempering 


Multimodal distributions can pose special problems for Markov chain simulation. The goal 
is to sample from the entire posterior distribution and this requires sampling from each 
of the modes with significant posterior probability. Unfortunately it is easy for Markov 
chain simulations to remain in the neighborhood of a single mode for a long period of time. 
This occurs primarily when two (or more) modes are separated by regions of extremely 
low posterior density. Then it is difficult to move from one mode to the other because, for 
example, Metropolis jumps to the region between the two modes are rejected. 

Simulated tempering is one strategy for improving Markov chain simulation performance 
in this case. As usual, we take p(6|y) to be the target density. The algorithm works with a 
set of K+1 distributions p,(6|y), k = 0,1,...,&, where po(@|u) = p(@ly), and pi,...,pK are 
distributions with the same basic shape but with improved opportunities for mixing across 
the modes, and each of these distributions comes with its own sampler (which might, for 
example, be a separately tuned Metropolis or HMC algorithm). As usual, the distributions 
pr need not be fully specified; it is only necessary that the user can compute unnormalized 
density functions q,, where qk(0) = px(6|y) multiplied by a constant which can depend on 
y and k but not on the parameters 6. (We write q,(@), but with the understanding that, 
since the q,’s are built for a particular posterior distribution p(@|y), they can in general 
depend on y.) 

One choice for the ladder of unnormalized densities gz is 


gn (9) = p(ly)\/7* po (0) Y, 


for a set of ‘temperature’ parameters Tk > 0. Setting Tk = 1 reduces to the original density, 
and large values of Tk produce less highly peaked modes, going in the limit to some ‘base 
measure’ po that is some convenient prechosen distribution with high variance. (That is, 
‘high temperatures’ add ‘thermal noise’ to the system.) A single composite Markov chain 
simulation is then developed that randomly moves across the K+1 distributions, with To 
set to 1 so that qo(@) x p(@ly). The state of the composite Markov chain at iteration t is 
represented by the pair (0°, st), where st is an integer identifying the distribution used at 
iteration t. Each iteration of the composite Markov chain simulation consists of two steps: 
1. A new value 6‘*" is selected using the Markov chain simulation with stationary distri- 
bution qst. 


2. A jump from the current sampler s‘ to an alternative sampler j is proposed with proba- 
bility Jst j. We accept the move with probability min(r, 1), where 


0595 (OF *) Jj, st 
Cet et (O57) Joe 5 
The constants c, for k = 0,1,..., K are set adaptively (that is, assigned initial values 
and then altered after the simulation has run a while) to approximate the inverses of 
the normalizing constants for the distributions defined by the unnormalized densities qx. 
The chain will then spend an approximately equal amount of time in each sampler. 


At the end of the Markov chain simulation, only those values of 0 simulated from the target 
distribution (qo) are used to obtain posterior inferences. 
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Parallel tempering is a variant of the above algorithm in which K + 1 parallel chains 
are simulated, one for each density qx in the ladder. Each chain moves on its own but with 
occasional flipping of states between chains, with a Metropolis accept-reject rule similar to 
that in simulated tempering. At convergence, the simulations from chain 0 represent draws 
from the target distribution. 

Other auxiliary variable methods have been developed that are tailored to particular 
structures of multivariate distributions. For example, highly correlated variables such as 
arise in spatial statistics can be simulated using multigrid sampling, in which computations 
are done alternately on the original scale and on coarser scales that do not capture the local 
details of the target distribution but allow faster movement between states. 


Particle filtering, weighting, and genetic algorithms 


Particle filtering describes a class of simulation algorithms involving parallel chains, in which 
existing chains are periodically tested and allowed to die, live, or split, with the rule set 
up so that chains in lower-probability areas of the posterior distribution are more likely to 
die and those in higher-probability areas are more likely to split. The idea is that a large 
number of chains can explore the parameter space, with the birth/death/splitting steps 
allowing the ensemble of chains to more rapidly converge to the target distribution. The 
probabilities of the different steps are set up so that the stationary distribution of the entire 
process is the posterior distribution of interest. 

A related idea is weighting, in which a simulation is performed that converges to a spec- 
ified but wrong distribution, g(0), and then the final draws are weighted by p(6|y)/g(0). In 
more sophisticated implementations, this reweighting can be done throughout the simula- 
tion process. It can sometimes be difficult or expensive to sample from p(6|y) and faster to 
work with a good approximation g if available. Weighting can be combined with particle 
filtering by using the weights in the die/live/split probabilities. 

Genetic algorithms are similar to particle filtering in having multiple chains that can 
live or die, but with the elaboration that the updating algorithms themselves can change 
(‘mutate’) and combine (‘sexual reproduction’). Many of these ideas are borrowed from 
the numerical analysis literature on optimization but can also be effective in a posterior 
simulation setting in which the goal is to converge to a distribution rather than to a single 
best value. 


12.4 Hamiltonian Monte Carlo 


An inherent inefficiency in the Gibbs sampler and Metropolis algorithm is their random walk 
behavior—as illustrated in Figures 11.1 and 11.2 on pages 276 and 277, the simulations 
can take a long time zigging and zagging while moving through the target distribution. 
Reparameterization and efficient jumping rules can improve the situation (see Sections 12.1 
and 12.2), but for complicated models this local random walk behavior remains, especially 
for high-dimensional target distributions. 

Hamiltonian Monte Carlo (HMC) borrows an idea from physics to suppress the local 
random walk behavior in the Metropolis algorithm, thus allowing it to move much more 
rapidly through the target distribution. For each component 0; in the target space, Hamilto- 
nian Monte Carlo adds a ‘momentum’ variable ¢;. Both 6 and ¢ are then updated together 
in a new Metropolis algorithm, in which the jumping distribution for 0 is determined largely 
by @. Each iteration of HMC proceeds via several steps, during which the position and mo- 
mentum evolve based on rules imitating the behavior of position the steps can move rapidly 
where possible through the space of 0 and even can turn corners in parameter space to 
preserve the total ‘energy’ of the trajectory. Hamiltonian Monte Carlo is also called hybrid 
Monte Carlo because it combines MCMC and deterministic simulation methods. 
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In HMC, the posterior density p(@|y) (which, as usual, needs only be computed up 
to a multiplicative constant) is augmented by an independent distribution p(¢) on the 
momenta, thus defining a joint distribution, p(0, oly) = p(¢)p(@ly). We simulate from the 
joint distribution but we are only interested in the simulations of 0; the vector ¢ is thus 
an auxiliary variable, introduced only to enable the algorithm to move faster through the 
parameter space. 

In addition to the posterior density (which, as usual, needs to be computed only up 
to a multiplicative constant), HMC also requires the gradient of the log-posterior density. 
In practice the gradient must be computed analytically; numerical differentiation requires 
too many function evaluations to be computationally effective. If 0 has d dimensions, this 


dlogp(@ly) _ ( dlog p(y) dlog p(0|y) 
dð = doi °°*? doa 


this book, this vector is easy to determine analytically and then program. When writing 
and debugging the program, we recommend also programming the gradient numerically 
(using finite differences of the log-posterior density) as a check on the programming of the 
analytic gradients. If the two subroutines do not return identical results to several decimal 
places, there is likely a mistake somewhere. 


gradient is ). For most of the models we consider in 


The momentum distribution, p(o) 


It is usual to give ¢ a multivariate normal distribution (recall that @ has the same dimension 
as 0) with mean 0 and covariance set to a prespecified ‘mass matrix’ M (so called by analogy 
to the physical model of Hamiltonian dynamics). To keep it simple, we commonly use a 
diagonal mass matrix, M. If so, the components of ¢ are independent, with ¢; ~ N(0, M;;) 
for each dimension j = 1,...,d. It can be useful for M to roughly scale with the inverse 
covariance matrix of the posterior distribution, (var(@|y))~', but the algorithm works in 
any case; better scaling of M will merely make HMC mote efficient. 


The three steps of an HMC iteration 


HMC proceeds by a series of iterations (as in any Metropolis algorithm), with each iteration 

having three parts: 

1. The iteration begins by updating ¢ with a random draw from its posterior distribution— 
which, as specified, is the same as its prior distribution, ¢ ~ N(0, M). 

2. The main part of the Hamiltonian Monte Carlo iteration is a simultaneous update of 
(0, $), conducted in an elaborate but effective fashion via a discrete mimicking of physical 
dynamics. This update involves L ‘leapfrog steps’ (to be defined in a moment), each 
scaled by a factor e. In a leapfrog step, both 0 and ¢ are changed, each in relation to the 
other. The L leapfrog steps proceed as follows: 

Repeat the following steps L times: 
(a) Use the gradient (the vector derivative) of the log-posterior density of 0 to make a 
half-step of ¢: 

1_ dlog ply) 

o+ =e 

Tep 2 dé 
(b) Use the ‘momentum’ vector ¢ to update the ‘position’ vector 0: 


6-—O+eM—1¢. 


Again, M is the mass matrix, the covariance of the momentum distribution p(¢). If 
M is diagonal, the above step amounts to scaling each dimension of the 0 update. (It 
might seem redundant to include € in the above expression: why not simply absorb it 
into M, which can itself be set by the user? The reason is that it can be convenient 
in tuning the algorithm to alter e while keeping M fixed.) 
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(c) Again use the gradient of 0 to half-update ¢: 
1 dlog p(4|y) 
PE OT og 


Except at the first and last step, updates (c) and (a) above can be performed together. 
The stepping thus starts with a half-step of ¢, then alternates L — 1 full steps of the 
parameter vector 0 and the momentum vector ¢, and concludes with a half-step of ¢. 
This algorithm (called a ‘leapfrog’ because of the splitting of the momentum updates 
into half steps) is a discrete approximation to physical Hamiltonian dynamics in which 
both position and momentum evolve in continuous time. 
In the limit of € near zero, the leapfrog algorithm preserves the joint density p(0, dly). 
We will not give the proof, but here is some intuition. Suppose the current value of 0 
is at a flat area of the posterior. Then Grocer (Oly) will be zero, and in step 2 above, the 
momentum will remain constant. Thus the leapfrog steps will skate along in 6-space with 
constant velocity. Now suppose the algorithm moves toward an area of low posterior 
density. Then Zog rely) will be negative in this direction, thus in step 2 inducing a 
decrease in the momentum in the direction of movement. As the leapfrog steps continue 
to move into an area of lower density in 6-space, the momentum continues to decrease. 
The decrease in log p(6|y) is matched (in the limit € —> 0, exactly so) by a decrease 
in the ‘kinetic energy,’ logp(¢). And if iterations continue to move in the direction 
of decreasing density, the leapfrog steps will slow to zero and then back down or curve 
around the dip. Now consider the algorithm heading in a direction in which the posterior 
density is increasing. Then dios p(6ly) will be positive in that direction, leading in step 2 
to an increase in momentum in that direction. As log p(6|y) increases, log p(ġ) increases 
correspondingly until the trajectory eventually moves past or around the mode and then 
starts to slow down. 
For finite €, the joint density p(0, |y) does not remain entirely constant during the 
leapfrog steps but it will vary only slowly if € is small. For reasons we do not discuss 
here, the leapfrog integrator has the pleasant property that combining L steps of error 6 
does not produce Lô error, because the dynamics of the algorithm tend to send the errors 
weaving back and forth around the exact value that would be obtained by a continuous 
integration. Keeping the discretization error low is important because of the next part 
of the HMC algorithm, the accept /reject step. 

3. Label 6*—!, ¢*—! as the value of the parameter and momentum vectors at the start of the 
leapfrog process and 6*, ¢* as the value after the L steps. In the accept-reject step, we 
compute 


pO |y)p(o*) 


= Fe ype) oa 


4. Set 


gt = (i with probability min(r, 1) 
0t} otherwise. 

Strictly speaking it would be necessary to set 6’ as well, but since we do not care about 

@ in itself, and it gets immediately updated at the beginning of the next iteration (see 

step 1 above), so there is no need to keep track of it after the accept/reject step. 


As with any other MCMC algorithm, we repeat these iterations until approximate conver- 
gence, as assessed by R being near 1 and the effective sample size being large enough for 
all quantities of interest; see Section 11.4. 
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Restricted parameters and areas of zero posterior density 


HMC is designed to work with all-positive target densities. If at any point during an 
iteration the algorithm reaches a point of zero posterior density (for example, if the steps 
go below zero when updating a parameter that is restricted to be positive), we stop the 
stepping and give up, spending another iteration at the previous value of 6. The resulting 
algorithm preserves detailed balance and stays in the positive zone. 

An alternative is ‘bouncing,’ where again the algorithm checks that the density is positive 
after each step and, if not, changes the sign of the momentum to return to the direction in 
which it came. This again preserves detailed balance and is typically more efficient than 
simply rejecting the iteration, for example with a hard boundary for a parameter that is 
restricted to be positive. 

Another way to handle bounded parameters is via transformation, for example taking the 
logarithm of a parameter constrained to be positive or the logit for a parameter constrained 
to fall between 0 and 1, or more complicated joint transformations for sets of parameters 
that are constrained (for example, if 6; < 02 < 03 or if ay + ag + a3 + a4 = 1). One must 
then work out the Jacobian of the transformation and use it to determine the log posterior 
density and its gradient in the new space. 


Setting the tuning parameters 


HMC can be tuned in three places: (i) the probability distribution for the momentum 
variables ¢ (which, in our implementation requires specifying the diagonal elements of a 
covariance matrix, that is, a scale parameter for each of the d dimensions of the parameter 
vector), (ii) the scaling factor e€ of the leapfrog steps, and (iii) the number of leapfrog steps 
L per iteration. 

As with the Metropolis algorithm in general, these tuning parameters can be set ahead 
of time, or they can be altered completely at random (a strategy which can sometimes be 
helpful in keeping an algorithm from getting stuck), but one has to take care when altering 
them given information from previous iterations. Except in some special cases, adaptive 
updating of the tuning parameters alters the algorithm so that it no longer converges to the 
target distribution. So when we set the tuning parameters, we do so during the warm-up 
period: that is, we start with some initial settings, then run HMC for a while, then reset 
the tuning parameters based on the iterations so far, then discard the early iterations that 
were used for warm-up. This procedure can be repeated if necessary, as long as the saved 
iterations use only simulations after the last setting of the tuning parameters. 

How, then, to set the parameters that govern HMC? We start by setting the scale 
parameters for the momentum variables to some crude estimate of the scale of the target 
distribution. (One can also incorporate covariance information but here we will assume a 
diagonal covariance matrix so that all that is required is the vector of scales.) By default 
we could simply use the identity matrix. 

We then set the product eL to 1. This roughly calibrates the HMC algorithm to the 
‘radius’ of the target distribution; that is, L steps, each of length e times the already-chosen 
scale of ¢, should roughly take you from one side of the distribution to the other. A default 
starting point could be «=0.1, L=10. 

Finally, theory suggests that HMC is optimally efficient when its acceptance rate is 
approximately 65% (based on an analysis similar to that which finds an optimal 23% ac- 
ceptance rate for the multidimensional Metropolis algorithm). The theory is based on all 
sorts of assumptions but seems like a reasonable guideline for optimization in practice. For 
now we recommend a simple adaptation in which HMC is with its initial settings and then 
adapted if the average acceptance probability (as computed from the simulations so far) is 
not close to 65%. If the average acceptance probability is lower, then the leapfrog jumps 
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are too ambitious and you should lower € and correspondingly increase L (so their product 
remains 1). Conversely, if the average acceptance probability is much higher than 65%, then 
the steps are too cautious and we recommend raising € and lowering L (not forgetting that 
L must be an integer). These rules do not solve all problems, and it should be possible to 
develop diagnostics to assess the efficiency of HMC to allow for more effective adaptation 
of the tuning parameters. 


Varying the tuning parameters during the run 


As with MCMC tuning more generally, any adaptation can go on during the warm-up 
period, but adaptation performed later on, during the simulations that will be used for 
inference, can cause the algorithm to converge to the wrong distribution. For example, 
suppose we were to increase the step size € after high-probability jumps and decrease e when 
the acceptance probability is low. Such an adaptation seems appealing but would destroy 
the detailed balance (that is, the property of the algorithm that the flow of probability mass 
from point A to B is the same as from B to A, for any points A and B in the posterior 
distribution) that is used to prove that the posterior distribution of interest is the stationary 
distribution of the Markov chain. 

Completely random variation of € and L, however, causes no problems with convergence 
and can be useful. If we randomly vary the tuning parameters (within specified ranges) 
from iteration to iteration while the simulation is running, the algorithm has a chance to 
take long tours through the posterior distribution when possible and make short movements 
where the iterations are stuck in a cramped part of the space. The price for this variation 
is some potential loss of optimality, as the algorithm will also take short steps where long 
tours would be feasible and try for long steps where the space is too cramped for such jumps 
to be accepted. 


Locally adaptive HMC 


For difficult HMC problems, it would be desirable for the tuning parameters to vary as 
the algorithm moves through the posterior distribution, with the mass matrix M scaling 
to the local curvature of the log density, the step size e getting smaller in areas where 
the curvature is high, and the number of steps L being large enough for the trajectory 
to move far through the posterior distribution without being so large that the algorithm 
circles around and around. To this end, researchers have developed extensions of HMC that 
adapt without losing detailed balance. These algorithms are more complicated and can 
require more computations per iteration but can converge more effectively for complicated 
distributions. We describe two such algorithms here but without giving the details. 


The no-U-turn sampler. In the no-U-turn sampler, the number of steps is determined 
adaptively at each iteration. Instead of running for a fixed number of steps, L, the trajectory 
in each iteration continues until it turns around (more specifically, until we reach a negative 
value of the dot product between the momentum variable ¢ and the distance traveled from 
the position 0 at the start of the iteration). This rule essentially sends the trajectory as far 
as it can go during that iteration. If such a rule is applied alone, the simulations will not 
converge to the desired target distribution. The full no-U-turn sampler is more complicated, 
going backward and forward along the trajectory in a way that satisfies detailed balance. 
Along with this algorithm comes a procedure for adaptively setting the mass matrix M 
and step size €; these parameters are tuned during the warm-up phase and then held fixed 
during the later iterations which are kept for the purpose of posterior inference. 


Riemannian adaptation. Another approach to optimization is Riemannian adaptation, in 
which the mass matrix M is set to conform with the local curvature of the log posterior 
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density at each step. Again, the local adaptation allows the sampler to move much more 
effectively but the steps of the algorithm need to become more complicated to maintain 
detailed balance. Riemannian adaptation can be combined with the no-U-turn sampler. 

Neither of the above extensions solves all the problems with HMC. The no-U-turn sam- 
pler is self-tuning and computationally efficient but, like ordinary Hamiltonian Monte Carlo, 
has difficulties with very short-tailed and long-tailed distributions, in both cases having dif- 
ficulties transitioning from the center to the tails, even in one dimension. Riemannian 
adaptation handles varying curvature and non-exponentially tailed distributions but is im- 
practical in high dimensions. 


Combining HMC with Gibbs sampling 


There are two ways in which ideas of the Gibbs sampler fit into Hamiltonian Monte Carlo. 
First, it can make sense to partition variables into blocks, either to simplify computation 
or to speed convergence. Consider a hierarchical model with J groups, with parameter 
vector 0 = (n,n®@,...,7°”,@), where each of the 7s is itself a vector of parameters 
corresponding to the model for group j and ¢ is a vector of hyperparameters, and for which 
the posterior distribution can be factored as, p(@ly) x ple) ie P(n |b)p(y nA). In 
this case, even if it is possible to update the entire vector 0 at once using HMC, it may 
be more effective—in computation speed or convergence—to cycle through J + 1 updating 
steps, altering each 7%) and then ¢ during each cycle. This way we only have to work with 
at most one of the likelihood factors, p(y‘|n), at each step. Parameter expansion can 
be used to facilitate quicker mixing through the joint distribution. 

The second way in which Gibbs sampler principles can enter HMC is through the updat- 
ing of discrete variables. Hamiltonian dynamics are only defined on continuous distributions. 
If some of the parameters in a model are defined on discrete spaces (for example, latent in- 
dicators for mixture components, or a parameter that follows a continuous distribution but 
has a positive probability of being exactly zero), they can be updated using Gibbs steps or, 
more generally, one-dimensional updates such as Metropolis or slice sampling (see Section 
12.3). The simplest approach is to partition the space into discrete and continuous param- 
eters, then alternate HMC updates on the continuous subspace and Gibbs, Metropolis, or 
slice updates on the discrete components. 


12.5 Hamiltonian Monte Carlo for a hierarchical model 


We illustrate the tuning of Hamiltonian Monte Carlo with the model for the educational 
testing experiments described in Chapter 5. HMC is not necessary in this problem—the 
Gibbs sampler works just fine, especially after the parameter expansion which allows more 
efficient movement of the hierarchical variance parameter (see Section 12.1)—but it is helpful 
to understand the new algorithm in a simple example. Here we go through all the steps of 
the algorithm. The code appears in Section C.4, starting on page 603. 

In order not to overload our notation, we label the eight school effects (defined as 0; in 
Chapter 5) as aj; the full vector of parameters 0 then has d = 10 dimensions, corresponding 
tO Q1,..., Mg, L,T- 


Gradients of the log posterior density. For HMC we need the gradients of the log posterior 
density for each of the ten parameters, a set of operations that are easily performed with 
the normal distributions of this model: 


d log ply) _ = ay a = for j =1 8 
da; = pm a 
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J 
d log p(y) _ => H— Qj 
iY — 
du T 
dlog pOl) _ _ J (way)? 
Dea o 24s oat 
j=1 
As a debugging step we also compute the gradients numerically using finite differences of 


+0.0001 on each component of 0. Once we have checked that the two gradient routines yield 
identical results, we use the analytic gradient in the algorithm as it is faster to compute. 


The mass matrix for the momentum distribution. As noted above, we want to scale the 
mass matrix to roughly match the posterior distribution. That said, we typically only have 
a vague idea of the posterior scale before beginning our computation; thus this scaling is 
primarily intended to forestall the problems that would arise if there are gross disparities 
in the scaling of different dimensions. In this case, after looking at the data in Table 5.2 we 
assign a rough scale of 15 for each of the parameters in the model and crudely set the mass 
matrix to Diag(157?,...,157°). 


Starting values. We run 4 chains of HMC with starting values drawn at random to crudely 
match the scale of the parameter space, in this case following the idea above and drawing 
the ten parameters in the model from independent N(0, 15°) distributions. 


Tuninge and L. To give the algorithm more flexibility, we do not set € and L to fixed values. 
Instead we choose central values €o, Lo and then at each step draw e and L independently 
from uniform distributions on (0, 2¢9) and [1,2Zo], respectively (with the distribution for L 
being discrete uniform, as L must be an integer). We have no reason to think this particular 
jittering is ideal; it is just a simple way to vary the tuning parameters in a way that does 
not interfere with convergence of the algorithm. Following the general advice given above, 
we start by setting coLo = 1 and Lo = 10. We simulate 4 chains for 20 iterations just to 
check that the program runs without crashing. 

We then do some experimentation. We first run 4 chains for 100 iterations and see that 
the inferences are reasonable (no extreme values, as can sometimes happen when there is 
poor convergence or a bug in the program) but not yet close to convergence, with several 
values of R that are more than 2. The average acceptance probabilities of the 4 chains are 
0.23, 0.59, 0.02, and 0.57, well below 65%, so we suspect the step size is too large. 

We decrease €o to 0.05, increase Lo to 20 (thus keeping eg Lo constant), and rerun the 4 
chains for 100 iterations, now getting acceptance rates of 0.72,, 0.87, 0.33, and 0.55, with 
chains still far from mixing. At this point we increase the number of simulations to 1000. 
The simulations now are close to convergence, with R less than 1.2 for all parameters, and 
average acceptance probabilities are more stable, at 0.52, 0.68, 0.75, and 0.51. We then 
run 4 chains at 10,000 simulations at these tuning parameters and achieve approximate 
convergence, with R less than 1.1 for all parameters. 

In this particular example, HMC is unnecessary, as the Gibbs sampler works fine on 
an appropriately transformed scale. In larger and more difficult problems, however, Gibbs 
and Metropolis can be too slow, while HMC can move effectively efficiently move through 
high-dimensional parameter spaces. 


Transforming to log T 


When running HMC on a model with constrained parameters, the algorithm can go outside 
the boundary, thus wasting some iterations. One remedy is to transform the space to be 
unconstrained. In this case, the simplest way to handle the constraint T >0 is to transform 
to logr. We then must alter the algorithm in the following ways: 
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1. We redefine 0 as (a1,...,@8, u, log7) and do all jumping on this new space. 


2. The (unnormalized) posterior density p(@|y) is multiplied by the Jacobian, 7, so we add 
log7 to the log posterior density used in the calculations. 


3. The gradient of the log posterior density changes in two ways: first, we need to account 
for the new term added just above; second, the derivative for the last component of the 
gradient is now with respect to logr rather than T and so must be multiplied by the 
Jacobian, T: 

J 
d log p(@ —a,;)? 
d log tr a 7? 

4. We change the mass matrix to account for the transformation. We keep aj,...,ag, 4 
with masses of 15 (roughly corresponding to a posterior distribution with a scale of 15 
in each of these dimensions) but set the mass of log r to 1. 


5. We correspondingly change the initial values by drawing the first nine parameters from 
independent N(0, 15?) distributions and log7 from N(0, 1). 


HMC runs as before. Again, we start with € = 0.1 and L = 10 and then adjust to get a 
reasonable acceptance rate. 


12.6 Stan: developing a computing environment 


Hamiltonian Monte Carlo takes a bit of effort to program and tune. In more complicated 
settings, though, we have found HMC to be faster and more reliable than basic Markov 
chain simulation algorithms. 

To mitigate the challenges of programming and tuning, we have developed a computer 
program, Stan (Sampling through adaptive neighborhoods) to automatically apply HMC 
given a Bayesian model. The key steps of the algorithm are data and model input, compu- 
tation of the log posterior density (up to an arbitrary constant that cannot depend on the 
parameters in the model) and its gradients, a warm-up phase in which the tuning param- 
eters are set, an implementation of the no-U-turn sampler to move through the parameter 
space, and convergence monitoring and inferential summaries at the end. 

We briefly describe how each of these steps is done in Stan. Instructions and examples 
for running the program appear in Appendix C. 


Entering the data and model 


Each line of a Stan model goes into defining the log probability density of the data and 
parameters, with code for looping, conditioning, computation of intermediate quantities, 
and specification of terms of the log joint density. Standard distributions such as the normal, 
gamma, binomial, Poisson, and so forth, are preprogrammed, and arbitrary distributions 
can be entered by directly programming the log density. Algebraic manipulations and 
functions such as exp and logit can also be included in the specification; it is all just sent 
into C++. 

To compute gradients, Stan uses automatic analytic differentiation, using an algorithm 
that parses arbitrary C++ expressions and then applies basic rules of differential calculus 
to construct a C++ program for the gradient. For computational efficiency, we have pre- 
programmed the gradients for various standard statistical expressions to make up some of 
this difference. We use special scalar variable classes that evaluate the function and at the 
same time construct the full expression tree used to generate the log probability. Then the 
reverse pass walks backward down the expression tree (visiting every dependent node before 
any node it depends on), propagating partial derivatives by the chain rule. The walk over 
the expression tree implicitly employs dynamic programming to minimize the number of 
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calculations. The resulting autodifferentiation is typically much faster than computing the 
gradient numerically via finite differences. 

In addition to the data, parameters, and model statements, a Stan call also needs the 
number of chains, the number of iterations per chain, and various control parameters that 
can be set by default. Starting values can be supplied or else they are generated from preset 
default random variables. 


Setting tuning parameters in the warm-up phase 


As noted above, it can be tricky to tune Hamiltonian Monte Carlo for any particular ex- 
ample. The no-U-turn sampler helps with this, as it eliminates the need to assign the 
number of steps L, but we still need to set the mass matrix M and step size e. During a 
prespecified warm-up phase of the simulation, Stan adaptively alters M and e using ideas 
from stochastic optimization in numerical analysis. This adaptation will not always work— 
for distributions with varying curvature, there will not in general be any single good set 
of tuning parameters—and if the simulation is having difficulty converging, it can make 
sense to look at the values of M and e chosen for different chains to better understand 
what is happening. Convergence can sometimes be improved by reparameterization. More 
generally, it could make sense to have different tuning parameters for different areas of the 
distribution—this is related to ideas such as Riemannian adaptation, which at the time of 
this writing we are incorporating into Stan. 


No-U-turn sampler 


Stan runs HMC using the no-U-turn sampler, preprocessing where possible by transforming 
bounded variables to put them on an unconstrained scale. For complicated constraints this 
cannot always be done automatically and then it can make sense for the user to reparame- 
terize in writing the model. While running, Stan keeps track of acceptance probabilities (as 
well as the simulations themselves), which can be helpful in getting inside the algorithm if 
there are problems with mixing of the chains. 


Inferences and postprocessing 


Stan produces multiple sequences of simulations. For our posterior inferences we discard 
the iterations from the warm-up period (but we save them as possibly of diagnostic use if 
the algorithm is not mixing well) and compute R and neg as described in Section 11.4. 


12.7 Bibliographic note 


For the relatively simple ways of improving simulation algorithms mentioned in Sections 
12.1 and 12.2, Tanner and Wong (1987) discuss data augmentation and auxiliary variables, 
and Hills and Smith (1992) and Roberts and Sahu (1997) discuss different parameterizations 
for the Gibbs sampler. Higdon (1998) discusses some more complicated auxiliary variable 
methods, and Liu and Wu (1999), van Dyk and Meng (2001), and Liu (2003) present 
different approaches to parameter expansion. The results on acceptance rates for efficient 
Metropolis jumping rules appear in Gelman, Roberts, and Gilks (1995); more general results 
for Metropolis-Hastings algorithms appear in Roberts and Rosenthal (2001) and Brooks, 
Giudici, and Roberts (2003). 

Gelfand and Sahu (1994) discuss the difficulties of maintaining convergence to the target 
distribution when adapting Markov chain simulations, as discussed at the end of Section 
12.2. Andrieu and Robert (2001) and Andrieu and Thoms (2008) consider adaptive Markov 
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chain Monte Carlo algorithms. Peltola, Marttinen, and Vehtari (2012) present adaptive 
multistep Metropolis-Hastings algorithm for variable selection in linear models. 

Slice sampling is discussed by Neal (2003), and simulated tempering is discussed by 
Geyer and Thompson (1993) and Neal (1996b). Besag et al. (1995) and Higdon (1998) 
review several ideas based on auxiliary variables that have been useful in high-dimensional 
problems arising in genetics and spatial models. 

Reversible jump MCMC was introduced by Green (1995); see also Richardson and Green 
(1997) and Brooks, Giudici, and Roberts (2003) for more on trans-dimensional MCMC. 

Mykland, Tierney, and Yu (1994) discuss an approach to MCMC in which the algorithm 
has regeneration points, or subspaces of 0, so that if a finite sequence starts and ends at a 
regeneration point, it can be considered as an exact (although dependent) sample from the 
target distribution. Propp and Wilson (1996) and Fill (1998) introduce a class of MCMC 
algorithms called perfect simulation in which, after a certain number of iterations, the 
simulations are known to have exactly converged to the target distribution. 

The book by Liu (2001) covers a wide range of advanced simulation algorithms includ- 
ing those discussed in this chapter. The monograph by Neal (1993) also overviews many 
of these methods. Hamiltonian Monte Carlo was introduced by Duane et al. (1987) in 
the physics literature and Neal (1994) for statistics problems. Neal (2011) reviews HMC, 
Hoffman and Gelman (2014) introduce the no-U-turn sampler, and Girolami and Calder- 
head (2011) introduce Riemannian updating; see also Betancourt and Stein (2011) and 
Betancourt (2013ab). Romeel (2011) explains how leapfrog steps tend to reduce discretiza- 
tion error in HMC. Leimkuhler and Reich (2004) discuss the mathematics in more detail. 
Griewank and Walther (2008) is a standard reference on algorithmic differentiation. 


12.8 Exercises 


1. Efficient Metropolis jumping rules: Repeat the computation for Exercise 11.2 using the 
adaptive algorithm given in Section 12.2. 

2. Simulated tempering: Consider the Cauchy model, y; ~ Cauchy(6,1),i = 1,...,n, with 
uniform prior on 6, and two data points, yı =1.3, y2 = 15.0. 

(a) Graph the posterior density. 

(b) Program the Metropolis algorithm for this problem using a symmetric Cauchy jumping 
distribution. Tune the scale parameter of the jumping distribution appropriately. 

(c) Program simulated tempering with a ladder of 10 inverse-temperatures, 0.1, ..., 1. 

(d) Compare your answers in (b) and (c) to the graph in (a). 

3. Hamiltonian Monte Carlo: Program HMC in R for the bioassay logistic regression ex- 
ample from Chapter 3. 

(a) Code the gradients analytically and numerically and check that the two programs give 
the same result. 

(b) Pick reasonable starting values for the mass matrix, step size, and number of steps. 

(c) Tune the algorithm to an approximate 65% acceptance rate. 

(d) Run 4 chains long enough so that each has an effective sample size of at least 100. 
How many iterations did you need? 

(e) Check that your inferences are consistent with those from the direct approach in 
Chapter 3. 

4. Coverage of intervals and rejection sampling: Consider the following model: yj ~ 
Binomial(n;,6;), where 6; = logit” (a + bzi), for j = 1,...,J, and with indepen- 
dent prior distributions, a ~ t4(0,27) and 8 ~ t4(0,1). Assume J = 10, the x; values 
are randomly drawn from a U(1, 1) distribution, and n; ~ Poisson? (5), where Poisson” 
is the Poisson distribution restricted to positive values. 
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(a) Sample a dataset at random from the model, estimate a and £ using Stan, and make 
a graph simultaneously displaying the data, the fitted model, and uncertainty in the 
fit (shown via a set of inverse logit curves that are thin and gray (in R, lwd=.5, 
col="gray")). 

(b) Did Stan’s posterior 50% interval for a contain its true value? How about 6? 

(c) Use rejection sampling to get 1000 independent posterior draws from (a, 8). 
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Chapter 13 


Modal and distributional approximations 


The early chapters of the book describe simulation approaches that work in low-dimensional 
problems. With complicated models, it is rare that samples from the posterior distribution 
can be obtained directly, and Chapters 11 and 12 describe iterative simulation algorithms 
that can be used with such models. In this chapter we describe various approaches based 
on distributional approximations. These methods are useful for quick inferences, as starting 
points for Markov chain simulation algorithms, and for large problems where iterative sim- 
ulation approaches are too slow. The approximations that we describe are relatively simple 
to compute and can provide valuable information about the fit of the model. 


In Section 13.1 we discuss algorithms for finding posterior modes. Beyond being useful 
in constructing distributional approximations, the posterior mode is often used in statistical 
practice as a point estimate, sometimes in the guise of a penalized likelihood estimate (where 
the logarithm of the prior density is considered as a penalty function). Section 13.2 discusses 
how, if the goal is to summarize the posterior distribution by a mode, it can make sense to 
use a different prior distribution than would be used in full Bayesian inference. Section 13.3 
presents normal and normal-mixture approximations centered at the mode. We continue in 
Sections 13.4-13.6 with EM (expectation maximization) and related approaches for finding 
marginal posterior modes, along with related approximations based on factorizing the joint 
posterior distribution. Finally, Sections 13.7 and 13.8 introduce variational Bayes and 
expectation propagation, two methods for constructing approximations to a distribution 
based on conditional moments. 

The proliferation of algorithms for Bayesian computing reflects the proliferation of chal- 
lenging applied problems that we are trying to solve using Bayesian methods. These prob- 
lems typically have large numbers of unknown parameters, hence the appeal of Bayesian 
inference and hence also the struggles with computation. When different approximating 
strategies are available, it can make sense to fit the model in multiple ways and then use 
the tools described in Chapters 6 and 7 to evaluate and compare them. 


13.1 Finding posterior modes 


In Bayesian computation, we search for modes not for their own sake, but as a way to 
begin mapping the posterior density. In particular, we have no special interest in finding 
the absolute maximum of the posterior density. If many modes exist, we should try to find 
them all, or at least all the modes with non-negligible posterior mass in their neighborhoods. 
In practice, we often first search for a single mode, and if it does not look reasonable in 
a substantive sense, we continue searching through the parameter space for other modes. 
To find all the local modes—or to make sure that a mode that has been found is the only 
important mode—sometimes one must run a mode-finding algorithm several times from 
different starting points. 


Even better, where possible, is to find the modes of the marginal posterior density of a 
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subset of the parameters. One then analyzes the distribution of the remaining parameters, 
conditional on the first subset. We return to this topic in Sections 13.4 and 13.5. 

A variety of numerical methods exist for solving optimization problems and any of these, 
in principle, can be applied to find the modes of a posterior density. Rather than attempt to 
cover this vast topic comprehensively, we introduce two simple methods that are commonly 
used in statistical problems. 


Conditional maximization 


Often the simplest method of finding modes is conditional maximization, also called step- 
wise ascent; simply start somewhere in the target distribution—for example, setting the 
parameters at rough estimates—and then alter one set of components of 6 at a time, leav- 
ing the other components at their previous values, at each step increasing the log posterior 
density. Assuming the posterior density is bounded, the steps will eventually converge to a 
local mode. The method of iterative proportional fitting for loglinear models (see Section 
16.7) is an example of conditional maximization. To search for multiple modes, run the 
conditional maximization routine starting at a variety of points spread throughout the pa- 
rameter space. It should be possible to find a range of reasonable starting points based on 
rough estimates of the parameters and problem-specific knowledge about reasonable bounds 
on the parameters. 

For many standard statistical models, the conditional distribution of each parameter 
given all the others has a simple analytic form and is easily maximized. In this case, 
applying a conditional maximization algorithm is easy: just maximize the density with 
respect to one set of parameters at a time, iterating until the steps become small enough 
that approximate convergence has been reached. We illustrate this process in Section 13.6 
for the example of the hierarchical normal model. 


Newton’s method 


Newton’s method, also called the Newton—Raphson algorithm, is an iterative approach based 
on a quadratic Taylor series approximation of the log posterior density, 


L(0) = log p(6|y). 


It is also acceptable to use an unnormalized posterior density, since Newton’s method uses 
only the derivatives of L(0), and any multiplicative constant in p is an additive constant in 
L. As we have seen in Chapter 4, the quadratic approximation is generally fairly accurate 
when the number of data points is large relative to the number of parameters. Start by 
determining the functions L’(@) and L’’(@), the vector of derivatives and matrix of second 
derivatives, respectively, of the logarithm of the posterior density. The derivatives can be 
determined analytically or numerically. The mode-finding algorithm proceeds as follows: 


1. Choose a starting value, 0°. 
2. For t= 1,2,3,..., 
(a) Compute L’(9*~-!) and L”(0t-1). The Newton’s method step at time t is based on the 
quadratic approximation to L(0) centered at 0*1. 
(b) Set the new iterate, 0°, to maximize the quadratic approximation; thus, 


gt = gti _ ere), 


The starting value, 6°, is important; the algorithm is not guaranteed to converge from all 
starting values, particularly in regions where —L” is not positive definite. Starting values 
may be obtained from crude parameter estimates, or conditional maximization could be 
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used to generate a starting value for Newton’s method. The advantage of Newton’s method 
is that convergence is extremely fast once the iterates are close to the solution, where the 
quadratic approximation is accurate. If the iterations do not converge, they typically move 
off quickly toward the edge of the parameter space, and the next step may be to try again 
with a new starting point. 


Quasi-Newton and conjugate gradient methods 


Computation and storage of —L” in Newton’s method may be costly. Quasi-Newton meth- 
ods such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method form an approximation 
of —L” iteratively using only gradient information. 

Conjugate gradient methods use only the gradient information but, instead of steepest 
descent, subsequent optimization directions are formed using conjugate direction formulas. 
Conjugate gradient is likely to require more iterations than Newton and quasi-Newton 
methods but uses less computation per iteration and requires less storage. 


Numerical computation of derivatives 


If the first and second derivatives of the log posterior density are difficult to determine an- 
alytically, one can approximate them numerically using finite differences. Each component 


of L’ can be estimated numerically at any specified value 0 = (01, ... , 0a) by 


where ð; is a small value and, using linear algebra notation, e; is the unit vector corre- 
sponding to the ith component of 6. The values of 6; are chosen based on the scale of the 
problem; typically, values such as 0.0001 are low enough to approximate the derivative and 
high enough to avoid roundoff error on the computer. The second derivative matrix at 0 is 
numerically estimated by applying the differencing again; for each i, j: 


LE (0) = @L _ d dL 
ij d0;d0; do; \ dbi 


Li(0 + djejly) — Li (0 — d5esly) 
25; 
[L(0 + die; + ôjej)— L(O — die; + ôjej) 


Q 


Q 


13.2 Boundary-avoiding priors for modal summaries 
Posterior modes on the boundary of parameter space 


The posterior mode is a good point summary of a symmetric posterior distribution. If the 
posterior is asymmetric, however, the mode can be a poor point estimate. 

Consider, for example, the posterior distribution for the group-level scale parameter in 
the 8-schools example, displayed on page 121 and repeated here as Figure 13.1. The mode 
of the (marginal) posterior distribution is at 7 = 0, corresponding to the model in which 
the effects of coaching on college admissions tests are the same in all eight schools. This 
conclusion is consistent with the data (as indicated by zero being the posterior mode) but on 
substantive grounds we do not believe the true variation to be exactly zero: the coaching 
programs in the eight schools differed, and so the effects should vary, if only by a small 
amount. 
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5 ; T — 2. = 
eS ee e a 
tau 


Figure 13.1 Marginal posterior density, p(T|y), for the standard deviation of the population of school 
effects 0; in the educational testing example. If we were to choose to summarize this distribution by 
its mode, we would be in the uncomfortable position of setting 7 =0, an estimate on the boundary 
of parameter space. 


Sampling distribution of t (true value is t = 0.5) 100 draws of the marginal likelihood p(y|t), each 
corresponding to a different random draw of y 


400 


200 
Marginal likelihood, p(y|t) 


Frequency (out of 1000 simulations) 


0 


o 1 2 
Maximum marginal likelihood estimate Hierarchical scale parameter, t 


Figure 13.2 From a simple one-dimensional hierarchical model with scale parameter 0.5 and data 
in 10 groups: (a) Sampling distribution of the marginal posterior mode of T under a uniform prior 
distribution, based on 1000 simulations of data from the model. (b) 100 simulations of the marginal 
likelihood, p(y|T). In this example, the point estimate is noisy and the likelihood function is not 
very informative about T. 


From a fully Bayesian perspective, the posterior distribution represented in Figure 13.1 
is no problem. The uniform prior distribution on 7 allows the possibility that this parameter 
can be arbitrarily small, but we are assigning a zero probability on the event that r = 0 
exactly. The resulting posterior distribution is defined on a continuous space and we can 
summarize it with random simulations or, if we want a point summary, by the posterior 
median (which in this example takes on the reasonable value of 4.9). 

The problem of zero mode of the marginal likelihood does not only arise in the 8-schools 
example. We illustrate with a stripped-down example with J = 10 groups: 


Yj ~ N(0;,1), for j = Ls cowed, 
for simplicity modeling the 6;’s with a normal distribution centered at zero: 
0j ia N(0, T’): 


In our simulation, we assume T = 0.5. 
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Figure 13.3 Various possible zero-avoiding prior densities for T, the group-level standard deviation 
parameter in the 8 schools example. We prefer the gamma with 2 degrees of freedom, which hits zero 
at T = 0 (thus ensuring a nonzero posterior mode) but clears zero for any positive T. In contrast, 
the lognormal and inverse-gamma priors effectively shut off T in some positive region near zero, or 
rule out high values of T. These are behaviors we do not want in a default prior distribution. 

All these priors are intended for use in constructing penalized likelihood (posterior mode) estimates; 
if we were doing full Bayes and averaging over the posterior distribution of T, we would be happy 
with a uniform or half-Cauchy prior density, as discussed in Section 5.7. 


From this model, we create 1000 simulated datasets y; for each we determine the 
marginal likelihood and the value at which it takes on its maximum. 

Figure 13.2a shows the sampling distribution of the maximum marginal likelihood esti- 
mate of 7 (in this simple example, we just solve for 7 in the equation 1 +7? = ty y$, 
with the boundary constraint that 7 = 0 if 5 eae V5 < 1). In almost half the simulations, 
the marginal likelihood is maximized at 7 = 0. There is enough noise here that it is almost 
impossible to do anything more than bound the group-level variance; the data do not allow 
an accurate estimate. 


Zero-avoiding prior distribution for a group-level variance parameter 


The problems in the above examples arise because the mode is taken as a posterior sum- 
mary. If we are planning to summarize the posterior distribution by its mode (perhaps for 
computational convenience or as a quick approximation, as discussed in this chapter), it 
can make sense to choose the prior distribution accordingly. 

What is an appropriate noninformative prior distribution, p(T), that will avoid boundary 
estimates in an 8-schools-like problem in which inferences are summarized by the posterior 
mode? To start with, p(T) must be zero at T = 0. 

Several convenient probability models on positive random variables have this property, 
including the lognormal (log tT ~ N(,,02)) which is convenient given our familiarity with 
the normal distribution, and the inverse-gamma (T? ~ Inv-gamma(a,, 3,)) which is con- 
ditionally conjugate in the hierarchical models we have been considering. Unfortunately, 
both these classes of prior distribution cut off too sharply near zero. The lognormal and the 
inverse-gamma both have effective lower bounds, below which the prior density declines so 
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rapidly as to effectively shut off some range of 7 near zero. If the scale parameters on these 
models are set to be vague enough, this lower bound can be made extremely low, but then 
the prior becomes strongly peaked. There is thus no reasonable default setting with these 
models; the user must either choose a vague prior which rules out values of T near zero, or 
a distribution that is highly informative on the scale of the data. 

Instead we prefer a prior model such as r ~ Gamma(2, 4), a gamma distribution with 
shape 2 and some large scale parameter. This density starts at 0 when 7 = 0 and then 
increases linearly from there, eventually curving gently back to zero for large values of T. 
The linear behavior at 7 = 0 ensures that no matter how concentrated the likelihood is near 
zero, the posterior distribution will remain consistent with the data, a property that does 
not hold with the lognormal or inverse-gamma prior distributions. 

Again, the purpose of this prior distribution is to give a good estimate when the pos- 
terior distribution for T is to be summarized by its mode, as is often the case in statistical 
computations with hierarchical models. If we were planning to use posterior simulations, we 
would generally not see any advantage to the gamma prior distribution and instead would 
go with the uniform or half-Cauchy as default choices, as discussed in Chapter 5. 


Boundary-avoiding prior distribution for a correlation parameter 


We next illustrate the difficulty of estimating correlation parameters, using a simple model 
of a varying-intercept, varying-slope regression with a 2 x 2 group-level variance matrix. 
Within each group j = 1,..., J, we assume a linear model: 


Yij ~ N(0ji + 052%;, 1), for i = licceti 


For our simulation we draw the x;’s independently from a unit normal distribution and set 
nj = 5 for all j. As before, we consider J = 10 groups. 
The two regression parameters in each group j are modeled as a random draw from a 


normal distribution: 
Oja SN 0 T? PT17T2 
O52 0 : PT1T2 72 ` 


As in the previous example, we average over the linear parameters 0 and work with the 
marginal likelihood, which can be computed analytically as 


J 
pluti, 72,0) = | | N@Gj10,Vj+7), 
j=l 


where Ê; and V; are the least-squares estimate and corresponding covariance matrix from 
F PT1T2 ) 
PT1T2 T2 ` 

For this example, we assume the true values of the variance parameters are Ti = T2 = 0.5 
and p = 0. For the goal of getting an estimate of p that is stable and far from the boundary, 
setting the true value to 0 should be a best-case scenario. Even here, though, it turns out 
we have troubles. 

As before, we simulate data and compute the marginal likelihood 1000 times. For this 
example we are focusing on p so we look at the value of p in the maximum marginal 
likelihood estimate of (T1, T2,p) and we also look at the profile likelihood for p; that is, 
the function Lprofile(ply) = max, ra ply|Ti, T2,p). As is standard in regression models, 
all these definitions are implicitly conditional on x, a point we discuss further in Section 
14.1. For each simulation, we compute the profile likelihood as a function of p using a 
numerical optimization routine applied separately to each p in a grid. The optimization is 


regressing y on x for the data in group j, and T= ( 
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Figure 13.4 From a simulated varying-intercept, varying-slope hierarchical regression with identity 
group-level covariance matrix: (a) Sampling distribution of the maximum marginal likelihood esti- 
mate p of the group-level correlation parameter, based on 1000 simulations of data from the model. 
(b) 100 simulations of the marginal profile likelihood, Lprofie(p|y) = maxz ‚ra P(y|T1, T2, p). In this 
example, the maximum marginal likelihood estimate is extremely variable and the likelihood function 
is not very informative about p. (In some cases, the profile likelihood for p is flat in some places; 
this occurs when the corresponding estimate of one of the variance parameters (Ti or T2) is zero, 
in which case p is not identified.) 


easy enough, because the marginal likelihood function can be written in closed form. The 
marginal posterior density for p, averaging over a uniform prior on (T1, T2,p), would take 
more effort to work out but would yield similar results. 

Figure 13.4 displays the results. In 1000 simulations, the maximum marginal likelihood 
estimate of group-level correlation is on the boundary (ô = +1) over 10% of the time, and 
the profile marginal likelihood for p is typically not very informative. In a fully Bayesian 
setting, we would average over p; in a penalized likelihood framework, we want a more 
stable point estimate. 

If the plan is to summarize inference by the posterior mode of p, we would replace the 
U(-1,1) prior distribution with p(p) x (1—p)(1+ ), which is equivalent to a Beta(2, 2) 
on the transformed parameter eti, The prior and resulting posterior densities are zero at 
the boundaries and thus the posterior mode will never be —1 or 1. However, as with the 
Gamma(2, 2) prior distribution for 7 discussed earlier in this section, the prior density for 
p is linear near the boundaries and thus will not contradict any likelihood. 


Degeneracy-avoiding prior distribution for a covariance matrix 


More generally, we want any mode-based point estimate or computational approximation of 
a covariance matrix to be non-degenerate, that is, to have positive variance parameters and 
a positive-definite correlation matrix. Again, we can ensure this property in the posterior 
mode by choosing a prior density that goes to zero when the covariance matrix is degenerate. 
By analogy to the one- and two-dimensional solutions above, for a general d x d covariance 
matrix we choose the Wishart(d + 3, AI) prior density, which is zero but with a positive 
constant derivative at the boundary. As before, we can set A to a large value based on the 
context of the problem. The resulting estimate of the covariance matrix is always positive 
definite but without excluding estimates near the boundary if they are supported by the 
likelihood. 

In the limit of large values of A, the Wishart(d + 3, AJ) prior distribution on a covari- 
ance matrix corresponds to independent Gamma(2, ô) prior distributions on each of the d 
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eigenvalues with ô —> 0 and thus can be seen as a generalization of our default model for 
variance parameters given above. In two dimensions, the multivariate model in the limit 
A — co corresponds to the prior distribution p(p) « (1 + p)(1 — p) as before. 

Again, we see this family of default Wishart prior distributions as a noninformative 
choice if the plan is to summarize or approximate inference for the covariance matrix by 
the posterior mode. For full Bayesian inference, there would be no need to choose a prior 
distribution that hits zero at the boundary; we would prefer something like a scaled inverse- 
Wishart model that generalizes the half-Cauchy prior distribution discussed in Section 5.7. 


13.3 Normal and related mixture approximations 
Fitting multivariate normal densities based on the curvature at the modes 


Once the mode or modes have been found (perhaps after including a boundary-avoiding 
prior distribution as discussed in the previous section), we can construct an approximation 
based on the (multivariate) normal distribution. For simplicity we first consider the case of 
a single mode at 6, where we fit a normal distribution to the first two derivatives of the log 
posterior density function at 6: 


Pnormal approx (4) = N(0|ô, Vo). 


The variance matrix is the inverse of the curvature of the log posterior density at the mode, 
2 Ži 
Vo = |- aa | , and this second derivative can be calculated analytically for 
0=6 


some problems or else approximated numerically as in (13.2). As usual, before fitting a nor- 
mal density, it makes sense to transform parameters as appropriate, often using logarithms 
and logits, so that they are defined on the whole real line with roughly symmetric distribu- 
tions (remembering to multiply the posterior density by the Jacobian of the transformation, 
as in Section 1.8). 


Laplace’s method for analytic approximation of integrals 


Instead of approximating just the posterior with normal distribution, we can use Laplace’s 
method to approximate integrals of a a smooth function times the posterior h(@)p(0|y). The 
approximation is proportional to a (multivariate) normal density in 6, and its integral is 
just 

approximation of E(h(9)|y) : h(8)p(9oly) (27) 4/2 |—u" (0) |7 7, 


where d is the dimension of 0, u(0) = log(h(@)p(O|y)), and ĝo is the point at which u(6) is 
maximized. 

If h(@) is a fairly smooth function, this approximation can be reasonable, due to the 
approximate normality of the posterior distribution, p(@|y), for large sample sizes (recall 
Chapter 4). Because Laplace’s method is based on normality, it is most effective for uni- 
modal posterior densities, or when applied separately to each mode of a multimodal density. 
We use Laplace’s method to compute the relative masses of the densities in a normal-mixture 
approximation to a multimodal density (13.4). 


Laplace’s method using unnormalized densities. If we are only able to compute the unnor- 


malized density q(0|y), we can apply Laplace’s method separately to hq and q to evaluate 
the numerator and denominator here: 


_ JAO ally) 


BO) = Oh 


(13.3) 
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Mixture approximation for multimodal densities 


Now suppose we have found K modes in the posterior density. The posterior distribution 
can then be approximated by a mixture of K multivariate normal distributions, each with 
its own mode 6,, variance matrix Vz, and relative mass wp. That is, the target density 
p(6|y) can be approximated by 


K 


Pnormal approx (9) x 5 wrN(6|6x, Vok). 
k=1 


For each k, the mass wy of the kth component of the multivariate normal mixture can be 
estimated by equating the posterior density, p(Oxly), or the unnormalized posterior density, 
q(Oxly), to the approximation, Pnormal approx On): at each of the K modes. If the modes are 
fairly widely separated and the normal approximation is appropriate for each mode, then 
we obtain 


wr = q(Îrly)| Vor", (13.4) 


which yields the normal-mixture approximation 


a 1 x = à 
Pnormal approx (9) x q(Oxly) exp (-3( a ôk)” Vz (0 _ f) . 


Multivariate t approximation instead of the normal 


For a broader distribution, one can replace each normal density by a multivariate t with 
some small number of degrees of freedom, v. The corresponding approximation is a mixture 
density function that has the functional form, 


A A «ret, A T (dH)/2 
Pt approx(8) < >> a(Bely) (v + (0 — x)” Von! (O — Ôe) 
k=1 


7 


where d is the dimension of 9. A value such as v = 4, which provides three finite moments 
for the approximating density, has worked in some examples. 

Several strategies can be used to improve the approximate distribution further, includ- 
ing analytically fitting the approximation to locations other than the modes, such as sad- 
dle points or tails, of the distribution; analytically or numerically integrating out some 
components; or moving to an iterative scheme such as variational Bayes or expectation 
propagation, as described in Sections 13.7 and 13.8. 


Sampling from the approximate posterior distributions 


It is easy to draw random samples from the multivariate normal or t-mixture approxi- 
mations. To generate a single sample from the approximation, first choose one of the K 
mixture components using the relative probability masses of the mixture components, wp, 
as multinomial probabilities. Appendix A provides details on how to sample from a single 
multivariate normal or t distribution using the Cholesky factorization of the scale matrix. 

The sample drawn from the approximate posterior distribution can be used in impor- 
tance sampling, or an improved sample can be obtained using importance resampling, as 
described in Section 10.4. 
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13.4 Finding marginal posterior modes using EM 


In problems with many parameters, normal approximations to the joint distribution are 
often useless, and the joint mode is typically not helpful. It is often useful, however, to 
base an approximation on a marginal posterior mode of a subset of the parameters; we 
use the notation 0 = (y,@) and suppose we are interested in first approximating p(ély). 
After approximating p(¢|y) as a normal or ¢ or a mixture of these, we may be able to 
approximate the conditional distribution, p(y|¢, y), as normal (or t, or a mixture) with 
parameters depending on ¢. In this section we address the first problem, and in the next 
section we address the second. 

The EM algorithm can be viewed as an iterative method for finding the mode of the 
marginal posterior density, p(¢|y), and is extremely useful for many common models for 
which it is hard to maximize p(¢|y) directly but easy to work with p(y|¢, y) and p(d|7, y). 
Examples of the EM algorithm appear in the later chapters of this book, including Sections 
18.4, 18.6, and 22.2; we introduce the method here. 

If we think of @ as the parameters in our problem and y as missing data, the EM 
algorithm formalizes a relatively old idea for handling missing data: start with a guess of 
the parameters and then (1) replace missing values by their expectations given the guessed 
parameters, (2) estimate parameters assuming the missing data are equal to their estimated 
values, (3) re-estimate the missing values assuming the new parameter estimates are correct, 
(4) re-estimate parameters, and so forth, iterating until convergence. In fact, the EM 
algorithm is more efficient than these four steps would suggest since each missing value is 
not estimated separately; instead, those functions of the missing data that are needed to 
estimate the model parameters are estimated jointly. 

The name ‘EM’ comes from the two alternating steps: finding the expectation of the 
needed functions (the sufficient statistics) of the missing values, and maximizing the result- 
ing posterior density to estimate the parameters as if these functions of the missing data 
were observed. For many standard models, both steps—estimating the missing values given 
a current estimate of the parameter and estimating the parameters given current estimates 
of the missing values—are straightforward. EM is widely applicable because many models, 
including mixture models and some hierarchical models, can be re-expressed as distribu- 
tions on augmented parameter spaces, where the added parameters y can be thought of as 
missing data. 


Derivation of the EM and generalized EM algorithms 


In the notation of this chapter, EM finds the modes of the marginal posterior distribution, 
p(dly), averaging over the parameters y. A more conventional presentation, in terms of 
missing and complete data, appears in Section 18.2. We show here that each iteration of 
the EM algorithm increases the value of the log posterior density until convergence. We 
start with the simple identity 


log p(¢ly) = log p(y, ly) — log píle, y) 


and take expectations of both sides, treating y as a random variable with the distribution 
p(y|¢°!4, y), where ¢°!4 is the current (old) guess. The left side of the above equation does 
not depend on y, so averaging over y yields 


log p(¢ly) = Eoia (log p(y, ly)) — Eoia (log p(y1¢, y)), (13.5) 


where Esq is an average over y under the distribution p(y|\¢°"4, y). The last term on the 
right side of (13.5), Eoia(log p(7|¢, y)), is maximized at ¢ = ¢°4 (see Exercise 13.6). The 
other term, the expected log joint posterior density, Eoia (log p(y, dly)), is repeatedly used 
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in computation, 


Exa(log p(y, ¢ly)) = J (log piy, ly) pC", yar. 


This expression is called Q(¢|¢°!“) in the EM literature, where it is viewed as the expected 
complete-data log-likelihood. 

Now consider any value ¢"°” for which Eoia (log p(y, œ” |y)) > Eoia(log p(y, ¢°'“|y)). If 
we replace ¢°4 by ¢®°”, we increase the first term on the right side of (13.5), while not 
increasing the second term, and so the total must increase: log p(¢"°”|y) > log p(¢°!4|y). 
This idea motivates the generalized EM (GEM) algorithm: at each iteration, determine 
Eoia(log p(y, ¢|y))—considered as a function of ¢—and update ¢ to a new value that in- 
creases this function. The EM algorithm is the special case in which the new value of ¢ is 
chosen to maximize Eoia(log p(y, oly)), rather than merely increase it. The EM and GEM 
algorithms both have the property of increasing the marginal posterior density, p(d|y), at 
each iteration. 

Because the marginal posterior density, p(¢|y), increases in each step of the EM algo- 
rithm, and because the Q function is maximized at each step, EM converges to a local mode 
of the posterior density except in some special cases. (Because the GEM algorithm does 
not maximize at each step, it does not necessarily converge to a local mode.) The rate at 
which the EM algorithm converges to a local mode depends on the proportion of ‘infor- 
mation’ about ¢ in the joint density, p(y, ély), that is missing from the marginal density, 
p(dly). It can be slow to converge if the proportion of missing information is large; see the 
bibliographic note at the end of this chapter for many theoretical and applied references on 
this topic. 


Implementation of the EM algorithm 


The EM algorithm can be described algorithmically as follows. 
1. Start with a crude parameter estimate, ¢°. 
2. Fort =1,2,...: 
(a) E-step: Determine the expected log posterior density function, 


Eoia (log p(y, ġly)) = fene) log p(y, dly)dy, 


where the expectation averages over the conditional posterior distribution of y, given 
the current estimate, ¢7¢ = ¢f-!. 

(b) M-step: Let ¢° be the value of ¢ that maximizes Foja (log p(y, dly)). For the GEM 
algorithm, it is only required that Eoia(log p(y, d|y)) be increased, not necessarily 
maximized. 


As we have seen, the marginal posterior density, p(¢|y), increases at each step of the EM 
algorithm, so that, except in some special cases, the algorithm converges to a local mode of 
the posterior density. 


Finding multiple modes. A simple way to search for multiple modes with EM is to start 
the iterations at many points throughout the parameter space. If we find several modes, 
we can roughly compare their relative masses using a normal approximation, as described 
in the previous section. 

Debugging. A useful debugging check when running an EM algorithm is to compute the 
logarithm of the marginal posterior density, log p(¢*|y), at each iteration, and check that it 
increases monotonically. This computation is recommended for all problems for which it is 
relatively easy to compute the marginal posterior density. 
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Example. Normal distribution with unknown mean and variance and par- 
tially conjugate prior distribution 

Suppose we weigh an object on a scale n times, and the weighings, y1,...,Yn, are 
assumed independent with a N(u, o°) distribution, where p is the true weight of the 
object. For simplicity, we assume a N(o,7@) prior distribution on p (with fo and 
To known) and the standard noninformative uniform prior distribution on log ø; these 
form a partially conjugate joint prior distribution. 

Because the model is not fully conjugate, there is no standard form for the joint pos- 
terior distribution of (~,o0) and no closed-form expression for the marginal posterior 
density of u. We can, however, use the EM algorithm to find the marginal poste- 
rior mode of u, averaging over g; that is, (4,0) corresponds to (¢,y) in the general 
notation. 


Joint log posterior density. The logarithm of the joint posterior density is 


1 1 Š 
log p(u, oly) = salu Uo)? — (n +1) logo — z X (ui — u)? + constant, (13.6) 
0 i=1 


ignoring terms that do not depend on p or o?. 


E-step. For the E-step of the EM algorithm, we must determine the expectation of 


(13.6), averaging over o and conditional on the current guess, u, and y: 
1 2 
Eoia log p(y, oly) = z7 Ho)” — (n + 1)Eoia (log o) 
0 
ln a )? + constant (13.7) 
— -Eou | = i— constant. ; 
5 Hota | z3 Yi— p 


i=1 


We must now evaluate Eoia (logo) and Eoa( 4). Actually, we need evaluate only the 
latter expression, because the former expression is not linked to p in (13.7) and thus 
will not affect the M-step. The expression Eal) can be evaluated by noting that, 
given u, the posterior distribution of o? is that for a normal distribution with known 
mean and unknown variance, which is scaled inverse-y?: 


1 n 
o?u, y ~ Inv-x? (r = Ce nP) - 


i=1 


Then the conditional posterior distribution of 4 is a scaled y?, and 


1 1 
Eaa ( +) =E{ = 


We can then re-express (13.7) as 


py) = (23-7) 


i=l 


=i 
1 2 1/1% oldy2 x 2 

Bou log (oly) =- zalu- Ho)” -3 (2 du # ) dw #) + const. 

(13.8) 

M-step. For the M-step, we must find the u that maximizes the above expression. 

For this problem, the task is straightforward, because (13.8) has the form of a normal 


log posterior density, with prior distribution u ~ N(uo,73) and n data points yi, 
each with variance + >", (yi — u°!¢)?.. The M-step is achieved by the mode of the 
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equivalent posterior density, which is 


1 n > 

7zlo + IS oe 
new __ To Š = Cae 
= : 


p T 
atai uey 
To i Ce ae 


If we iterate this computation, u converges to the marginal mode of p(ply). 


Extensions of the EM algorithm 


Variants and extensions of the basic EM algorithm increase the range of problems to which 
the algorithm can be applied, and some versions can converge much more quickly than the 
basic EM algorithm. In addition, the EM algorithm and its extensions can be supplemented 
with calculations that obtain the second derivative matrix for use as an estimate of the 
asymptotic variance at the mode. We describe some of these modifications here. 


The ECM algorithm. The ECM algorithm is a variant of the EM algorithm in which 
the M-step is replaced by a set of conditional maximizations, or CM-steps. Suppose that 
@' is the current iterate. The E-step is unchanged: the expected log posterior density 
is computed, averaging over the conditional posterior distribution of y given the current 
iterate. The M-step is replaced by a set of S conditional maximizations. At the sth 
conditional maximization, the value of ¢'t*/5 is found that maximizes the expected log 
posterior density among all ¢ such that gs(¢) = gs(¢’t*-)/S) with the g,(-) known as 
constraint functions. The output of the last CM-step, ¢'+5/5 = ¢'+!, is the next iterate of 
the ECM algorithm. The set of constraint functions, g,(-),s = 1,...,.5, must satisfy certain 
conditions in order to guarantee convergence to a stationary point. The most common choice 
of constraint function is the indicator function for the sth subset of the parameters. The 
parameter vector ¢ is partitioned into ŞS disjoint and exhaustive subsets, (¢1,...,¢s), and 
at the sth conditional maximization step, all parameters except those in s are constrained 
to equal their current values, girs S- geu S for j s. An ECM algorithm based on 
a partitioning of the parameters is an example of a generalized EM algorithm. Moreover, 
if each of the CM steps maximizes by setting first derivatives equal to zero, then ECM 
shares with EM the property that it will converge to a local mode of the marginal posterior 
distribution of ¢. Because the log posterior density, p(¢|y), increases with every iteration 
of the ECM algorithm, its monotone increase can still be used for debugging. 

As described in the previous paragraph, ECM performs several CM-steps after each E- 
step. The multicycle ECM algorithm performs additional E-steps during a single iteration. 
For example, one might perform an additional E-step before each conditional maximization. 
Multicycle ECM algorithms require more computation for each iteration than the ECM 
algorithm but can sometimes reach approximate convergence with fewer total iterations. 


The ECME algorithm. The ECME algorithm is an extension of ECM that replaces some 
of the conditional maximization steps of the expected log joint density, Eoia (log p(y, ¢|y)), 
with conditional maximization steps of the actual log posterior density, log p(¢|y). The last 
E in the acronym refers to the choice of maximizing either the actual log posterior density 
or the expected log posterior density. Iterations of ECME also increase the log posterior 
density at each iteration. Moreover, if each conditional maximization sets first derivatives 
equal to zero, ECME will converge to a local mode. 

ECME can be especially helpful at increasing the rate of convergence, since the actual 
marginal posterior density is being increased on some steps rather than the full posterior 
density averaged over the current estimate of the distribution of the other parameters. The 
increase in speed of convergence can be dramatic when faster converging numerical methods 
(such as Newton’s method) are applied directly to the marginal posterior density on some of 
the CM-steps. For example, if one CM-step requires a one-dimensional search to maximize 
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the expected log joint posterior density then the same effort can be applied directly to the 
logarithm of the marginal posterior density of interest. 


The AECM algorithm. The ECME algorithm can be further generalized by allowing differ- 
ent alternating definitions of y at each conditional maximization step. This generalization 
is most straightforward when ¢ represents missing data, and where there are different ways 
of completing the data at different maximization steps. In some problems the alternation 
can allow much faster convergence. The AECM algorithm shares with EM the property of 
converging to a local mode with an increase in the posterior density at each step. 


Supplemented EM and ECM algorithms 


The EM algorithm is attractive because it is often easy to implement and has stable and 
reliable convergence properties. The basic algorithm and its extensions can be enhanced 
to produce an estimate of the asymptotic variance matrix at the mode, which is useful in 
forming approximations to the marginal posterior density. The supplemented EM (SEM) 
algorithm and the supplemented ECM (SECM) algorithm use information from the log joint 
posterior density and repeated EM- or ECM-steps to obtain the approximate asymptotic 
variance matrix for the parameters œ. 

To describe the SEM algorithm we introduce the notation M (¢) for the mapping defined 
implicitly by the EM algorithm, œt = M(¢*). The asymptotic variance matrix V is 


V= Vioint + VjointD ar (I = Dm), 


where Dm is the Jacobian matrix for the EM map evaluated at the marginal mode, d, and 
Vjoint is the asymptotic variance matrix based on the logarithm of the joint posterior density 


averaged over y, 
d? log p(o, Vly p 
Vjoint = [e (- Seana Q, y x 
p=¢ 


Typically, Vjoint can be computed analytically so that only Dm is required. The matrix Dm 
is computed numerically at each marginal mode using the E- and M-steps according to the 
following algorithm. 


1. Run the EM algorithm to convergence, thereby obtaining the marginal mode, ĝ. (If 
multiple runs of EM lead to different modes, apply the following steps separately for 
each mode.) 


2. Choose a starting value ¢° for the SEM calculation such that ¢° does not equal db for any 
component. One possibility is to use the same starting value that is used for the original 
EM calculation. 


3. Repeat the following steps to get a sequence of matrices Rt, t = 1,2,3,..., for which each 
element rij converges to the appropriate element of Dm. In the following we describe 
the steps used to generate Rt given the current EM iterate, ¢*. 

(a) Run the usual E-step and M-step with input ¢° to obtain ¢t+t. 


(b) For each element of ¢, say ¢;: 


i. Define ¢'(i) equal to Q except for the ith element, which is replaced by its current 
value ¢'. 

ii. Run one E-step and one M-step treating ¢*(i) as the input value of the parameter 
vector, ¢. Denote the result of these E- and M-steps as ¢‘t!(i). The ith row of Rt 
is computed as 

t+1p.) A 
i= oy Wae for each j. 


j pt — Qi 
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When the value of an element rij no longer changes, it represents a numerical estimate 
of the corresponding element of Dm. Once all of the elements in a row have converged, 
then we need no longer repeat the final step for that row. If some elements of @ are 
independent of y, then EM will converge immediately to the mode for that component with 
the corresponding elements of Dm equal to zero. SEM can be easily modified in such cases 
to obtain the variance matrix. 

The same approach can be used to supplement the ECM algorithm with an estimate of 
the asymptotic variance matrix. The SECM algorithm is based on the following result: 


V = Voins + Vjoint(Dir — DUDU — DF, 


with Ds defined and computed in a manner analogous to Dm in the above discussion 
except with ECM in place of EM, and where pM is the rate of convergence for conditional 
maximization applied directly to log p(¢|y). This latter matrix depends only on Vjoint and 


V;.=Vgs(¢),s =1,...,S, the gradient of the vector of constraint functions gs at ¢: 


S 
DSM = [| [Vs (V? Voin Va) VT Voin] - 


s=l1 


These gradient vectors are trivial to calculate for a constraint that directly fixes components 
of d. In general, SECM appears to require analytic work to compute Vjoing and Vs,s = 
1,...,S, in addition to applying the numerical computation for De, but some of these 
calculations can be performed using results from the ECM iterations. 


Parameter-expanded EM (PX-EM) 


The various methods discussed in Section 12.1 for improving the efficiency of Gibbs samplers 
have analogues for mode-finding (and in fact were originally constructed for that purpose). 
For example, the parameter expansion idea illustrated with the t model on page 295 was 
originally developed in the context of the EM algorithm. In this setting, the individual 
latent data variances V; are treated as missing data, and the ECM algorithm maximizes 
over the parameters u, T, and a in the posterior distribution. 


13.5 Conditional and marginal posterior approximations 
Approximating the conditional posterior density, p(y|¢, y) 


As stated at the beginning of Section 13.4, the normal, multivariate t, and other analyti- 
cally convenient distributions can be poor approximations to a joint posterior distribution. 
Often, however, we can partition the parameter vector as 0 = (y,¢@), in such a way that an 
analytic approximation works well for the conditional posterior density, p(7|¢,y). In gen- 
eral, the approximation will depend on ¢. We write the approximate conditional density 
aS Papprox(7|¢, y). For example, in the hierarchical model in Section 5.4, we fitted a nor- 
mal distribution to p(0, |7, y) but not to p(rly) (in that example, the normal conditional 
distribution is an exact fit). 


Approximating the marginal posterior density, p(d|y), using an analytic approximation to 
ple, y) 


The mode-finding techniques and normal approximation of Sections 13.1 and 13.3 can be 
applied directly to the marginal posterior density if the marginal distribution can be ob- 
tained analytically. If not, then the EM algorithm (Section 13.4) may allow us to find the 
mode of the marginal posterior density and construct an approximation. On occasion it is 
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not possible to construct an approximation to p(ġ|y) using any of those methods. Fortu- 
nately we may still derive an approximation if we have an analytic approximation to the 
conditional posterior density, p(y|¢,y). We can use a trick used in (5.19) in Section 5.4 
to generate an approximation to p(d|y). The approximation is constructed as the ratio of 
the joint posterior distribution to the analytic approximation of the conditional posterior 
distribution: 
py, oly) 

Papprox (yle, y) i 


The key to this method is that if the denominator has a standard analytic form, we can 
compute its normalizing factor, which, in general, depends on ¢. When using (13.9), we must 
also specify a value y (possibly as a function of ¢) since the left side does not involve y at all. 
If the analytic approximation to the conditional distribution is exact, the factors of y in the 
numerator and denominator cancel, and we obtain the marginal posterior density exactly. 
If the analytic approximation is not exact, a natural value to use for y is the center of the 
approximate distribution (for example, E(y|¢, y) under the normal or t approximations). 

For example, suppose we have approximated the d-dimensional conditional density func- 
tion, p(7|¢, y), by a multivariate normal density with mean ¥ and scale matrix V}, both of 
which depend on ¢. We can then approximate the marginal density of é by 


Papprox (ly) « PACA), ly) [Vy (A, (13.10) 


where œ is included in parentheses to indicate that the mean and scale matrix must be 
evaluated at each value of ¢. The same result holds if a t approximation is used; in either 
case, the normalizing factor in the denominator of (13.9) is proportional to |V,(¢)|~1/?. 


Papprox(@|y) = (13.9) 


Improving an approximation using importance resampling. We can improve the approx- 
imation with importance sampling, using draws of y from each value of ¢ at which the 
approximation is computed. For any given value of ¢, we can write the marginal posterior 
density as 


p(oly) 


Í p(y, oly)dy 


J n AA [eyja 


p(y, oly) ) 


pea ae 13.11 
Papprox(7|¢, y) ( ) 


= Eapprox ( 


where Eapprox averages over y using the conditional posterior distribution, Papprox(7|¢, y). 
The importance sampling estimate of p(¢d|y) can be computed by simulating S values q5 
from the approximate conditional distribution, computing the joint density and approximate 
conditional density at each y*, and then averaging the S values of p(*, ly) /Papprox(Y*|¢, y). 
This procedure is then repeated for each point on the grid of ¢. If the normalizing con- 
stant for the joint density p(y, |y) is itself unknown, then more complicated computational 
procedures must be used. 


13.6 Example: hierarchical normal model (continued) 


We illustrate mode-based computations with the hierarchical normal model that we used 
in Section 11.6. In that section, we illustrated the Gibbs sampler and the Metropolis 
algorithm as two different ways of drawing posterior samples. In this section, we describe 
how to get approximate inference by finding the mode of p(y, logo, log T|y), the marginal 
posterior density, and a normal approximation centered at the mode. Given (u, log a, logT), 
the individual means @; have independent normal conditional posterior distributions. 
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Stepwise ascent 
Crude First Second Third 


Parameter . : : i R : Í 
estimate iteration iteration iteration 
0i 61.00 61.28 61.29 61.29 
02 66.00 65.87 65.87 65.87 
03 68.00 67.74 67.73 67.73 
04 61.00 61.15 61.15 61.15 
u 64.00 64.01 64.01 64.01 
g 2.29 2.17 2.17 2.17 
T 3.56 3.32 3.31 3.31 
log p(0, u, log o, log r|y) | —63.70 —61.42 —61.42 —61.42 


Table 13.1 Convergence of stepwise ascent to a joint posterior mode for the coagulation example. 
The joint posterior density increases at each conditional maximization step, as it should. The 
posterior mode is in terms of logo and logt, but these values are transformed back to the original 
scale for display in the table. 


Crude initial parameter estimates 


Initial parameter estimates for the computations are easily obtained by estimating 0; as 
J.j, the average of the observations in the jth group, for each j, and estimating a? as the 
average of the J within-group sample variances, s? = 37j2,(yij — 9.;)?/(nj — 1). We then 
crudely estimate u and T°? as the mean and variance of the J estimated values of 0;. For 
the coagulation data in Table 11.2, the crude estimates are shown in the first column of 


Table 13.1. 


Conditional maximization to find the joint mode of p(0, u, logo, log rly) 


Because of the conjugacy of the normal model, it is easy to perform conditional maximiza- 
tion on the joint posterior density, updating each parameter in turn by its conditional mode. 
In general, we analyze scale parameters such as o and 7 on the logarithmic scale. The con- 
ditional modes for each parameter are easy to compute, especially because we have already 
determined the conditional posterior density functions in computing the Gibbs sampler for 
this problem in Section 11.6. After obtaining a starting guess for the parameters, the con- 
ditional maximization proceeds as follows, where the parameters can be updated in any 
order. 


1. Conditional mode of each 0j. The conditional posterior distribution of 6;, given all other 
parameters in the model, is normal and given by (11.10). For j = 1,...,J, we can 
maximize the conditional posterior density of 6; given (w,0,7,y) (and thereby increase 
the joint posterior density), by replacing the current estimate of 0; by Ê; in (11.10). 

2. Conditional mode of u. The conditional posterior distribution of u, given all other 
parameters in the model, is normal and given by (11.12). For conditional maximization, 
replace the current estimate of u by fi in (11.13). 

3. Conditional mode of loga. The conditional posterior density for ø? is scaled inverse-y? 
and given by (11.14). The mode of the conditional posterior density of logo is obtained 
by replacing the current estimate of logo with logé, with ô defined in (11.15). (From 
Appendix A, the conditional mode of ø? is 0. The factor of aa does not appear 
in the conditional mode of logo because of the Jacobian factor when transforming from 
p(o?) to p(log a); see Exercise 13.7.) 
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4. Conditional mode of log rt. The conditional posterior density for 7? is scaled inverse-y? 
and given by (11.16). The mode of the conditional posterior density of log T is obtained 
by replacing the current estimate of logr with log7, with 7 defined in (11.17). 


Numerical results of conditional maximization for the coagulation example are presented 
in Table 13.1, from which we see that the algorithm has required only three iterations to 
reach approximate convergence in this small example. We also see that this posterior 
mode is extremely close to the crude estimate, which occurs because the shrinkage factors 
= / (Z +77) are all near zero. Incidentally, the mode displayed in Table 13.1 is only a local 
mode; the joint posterior density also has another mode at the boundary of the parameter 
space; we are not especially concerned with that degenerate mode because the region around 
it includes little of the posterior mass (see Exercise 13.8). 

In a simple applied analysis, we might stop here with an approximate posterior distribu- 
tion centered at this joint mode, or even just stay with the simpler crude estimates. In other 
hierarchical examples, however, there might be quite a bit of pooling, as in the educational 
testing problem of Section 5.5, in which case it is advisable to continue the analysis, as we 
describe below. 


Factoring into conditional and marginal posterior densities 


As discussed, the joint mode often does not provide a useful summary of the posterior 
distribution, especially when J is large relative to the n;’s. To investigate this possibility, 
we consider the marginal posterior distribution of a subset of the parameters. In this 
example, using the notation of the previous sections, we set y = (61,...,47) = 0 and ¢ = 
(1, logo, log rT), and we consider the posterior distribution as the product of the marginal 
posterior distribution of @ and the conditional posterior distribution of 0 given ¢. The 
subvector (u, logo, log T) has only three components no matter how large J is, so we expect 
asymptotic approximations to work better for the marginal distribution of ¢ than for the 
joint distribution of (y, ¢). 

From (11.9) in the Gibbs sampling analysis of the coagulation data in Chapter 11, the 
conditional posterior density of the normal means, p(6| 4, 0,7, y), is a product of independent 
normal densities with means 6; and variances Vg, that are easily computable functions of 
(u,0,T, Y). 

The marginal posterior density, p(n, log o, log T|y), of the remaining parameters, can be 
determined using formula (13.9), where the conditional distribution papprox (|, log a, log T, y) 
is actually exact. Thus, 


p(0, u, log o, log ly) 
p(O|u, log a, log T, y) 
J J Nj 
T Mi= N(0;ju, T?) [j= Iii N(yis|9;, 07) 
—— ee 
ss N(0;|6;, Vo, ) 


p(y, logo, logtly) = 


Because the denominator is exact, this identity must hold for any 6; to simplify calculations, 
we set 0 = 0, to yield 


J J nj J 
plu, log 0, log rly) œ 7 [| NGjle,77) TT J] Nwisl4s,07) T] Vay”: (13.12) 
j=li=1 j=l 


j=1 


with the final factor coming from the normalizing constant of the normal distribution in 
the denominator, and where 0; and Vg, are defined by (11.11). 
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Finding the marginal posterior mode of p(t, log a, log T\y) using EM 


The marginal posterior mode of (u, o, T)—the maximum of (13.12)—cannot be found an- 
alytically because the 6,’s and Vg,’s are functions of (y4,0,7). One possible approach is 
Newton’s method, which requires computing derivatives and second derivatives analytically 
or numerically. For this problem, however, it is easier to use the EM algorithm. 

To obtain the mode of p(, logo, log rly) using EM, we average over the parameter 0 
in the E-step and maximize over (u, log o, log T) in the M-step. The logarithm of the joint 
posterior density of all the parameters is 


J 
1 
log p(8, p, logo, logr|y) = —n logo — (J — 1) logr — 77 2 — 
Le 
=a SS (uy — 04)? + constant. (13.13) 
j=l i=l 


E-step. The E-step, averaging over 0 in (13.13), requires determining the conditional pos- 
terior expectations Eoia((9; — 4)*) and Eoia((yij — 9;)) for all j. These are both easy to 
compute using the conditional posterior distribution p(0|u, o, T, y), which we have already 
determined in (11.9). 


Eoia ((0; 5 py j -= E((0; _ uy)? wold, gold rod y) 
= (Eoia (0; = u)? + varola (0; ) 
= (6;—)?+Ve,. 


Using a similar calculation, 


Bota ((yig — 95)”) = (viz — 93)” + Vo,- 


For both expressions, 6; and Vg, are computed from (11.11) based on (p, log a, log re mae 


M-step. It is now straightforward to maximize Esja (log p(9, u, log o, log T|y)) as a function 
of (u,logo,logr). The maximizing values are (w°”,loga®®”, logr®*”), with (u,0,7)"°™ 
obtained by maximizing (13.13): 


J = 
1 J nj i 
onew (2 ye ((vis - 65)" + Vo, J 
j=1 71 
- 1/2 
new = an 5 (6, p oe + Vo, ,) : (13.14) 
j=l 


The derivation of these is straightforward (see Exercise 13.9). 


Checking that the marginal posterior density increases at each step. Ideally, at each itera- 
tion of EM, we would compute the log of (13.12) using the just calculated (u, log g, log r)?°”. 
If the function does not increase, there is a mistake in the analytic calculations or the pro- 
gramming, or possibly a roundoff error, which can be checked by altering the precision of 
the calculations. 

We apply EM to the coagulation example, using the values of (o,u,7) from the joint 
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EM algorithm 
Value at First Second Third 
Parameter ae : . : ‘ - : 
joint mode iteration iteration iteration 
u 64.01 64.01 64.01 64.01 
o 2.17 2.33 2.36 2.36 
T 3.31 3.46 3.47 3.47 
log p(n, log o, log T|y) —61.99 —61.835 —61.832 —61.832 


Table 13.2 Convergence of the EM algorithm to the marginal posterior mode of (u,log ø,log T) 
for the coagulation example. The marginal posterior density increases at each EM iteration, as it 
should. The posterior mode is in terms of logo and logT, but these values are transformed back to 
the original scale for display in the table. 


Parameter Posterior quantiles 

2.5% 25% median 75% 97.5% 
4 59.15 60.63 61.38 62.18 63.87 
02 63.83 65.20 65.78 66.42 67.79 
03 65.46 66.95 67.65 68.32 69.64 
04 59.51 60.68 61.21 61.77 62.99 
H 60.43 62.73 64.05 65.29 67.69 
o 1.75 2.12 2.37 2.64 3.21 
T 1.44 2.62 3.43 4.65 8.19 


Table 13.3 Summary of posterior simulations for the coagulation example, based on draws from 
the normal approximation to p(u,loga,logt|y) and the exact conditional posterior distribution, 
p(0|u, logo, log T, y). Compare to joint and marginal modes in Tables 18.1 and 13.2. 


mode as a starting point; numerical results appear in Table 13.2, where we see that the 
EM algorithm has approximately converged after only three steps. As typical in this sort 
of problem, the variance parameters o and 7 are larger at the marginal mode than the joint 
mode. The logarithm of the marginal posterior density, log p(y, log a,log7|y), has been 
computed to the (generally unnecessary) precision of three decimal places for the purpose 
of checking that it does, indeed, monotonically increase. (If it had not, we would have 
debugged the program before including the example in the book!) 


Constructing an approximation to the joint posterior distribution 


Having found the mode, we can construct a normal approximation based on the 3 x 3 matrix 
of second derivatives of the marginal posterior density, p(u, log o, log T|y), in (13.12). To 
draw simulations from the approximate joint posterior density, first draw (p, log g, log T) 
from the approximate normal marginal posterior density, then 6 from the conditional pos- 
terior distribution, p(6|u,log o, log T, y), which is already normal and so does not need to 
be approximated. Table 13.3 gives posterior intervals for the model parameters from these 
simulations. 


Comparison to other computations 


If we determine that the approximate inferences are not adequate, the approximation that 
we have developed can still serve us as a comparison point to more complicated algo- 
rithms, and also to obtain starting points. For example, we can obtain a roughly overdis- 
persed approximation to the target distribution by sampling from the t4 approximation for 
(u, logo, log T), and then we can subsample using importance resampling (see Section 10.4) 
and use these as starting points for the iterative simulations. 
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13.7 Variational inference 


Variational Bayes is an algorithmic framework, similar to EM, for approximating a joint 
distribution. EM proceeds by alternately evaluating conditional expectations of the log 
density and using these to maximize a function of a set of hyperparameters (which in turn 
define the conditional distribution used to compute the expectation in the next step, and so 
forth), converging to a point estimate of the hyperparameters and thus an approximation 
to the posterior distribution. In variational Bayes, the iterations lead to a closed-form 
approximation that is the closest fit to the posterior distribution (in a sense defined below) 
within some specified class of functions. 


Minimization of Kullback-Leibler divergence 


In variational Bayes, a parametric approximation g(@) is constructed iteratively using an 
expectation procedure that, as we shall show, has the effect of minimizing the Kullback- 
Leibler divergence from the target posterior distribution p(6|y), 


KL(g||p) = —E, (s (ey )) a= J log (oe) g(0)d0. (13.15) 


The absolute minimum of this divergence is 0, which is attained when g = p. The dif- 
ficulty is that variational Bayes is typically applied in settings where we cannot directly 
summarize p—that is, we cannot easily take posterior draws from p(@|y), nor can we easily 
compute expectations of interest, E,(h(0)). In variational Bayes we work with some simpler 
parameterized class of distributions g that are easier to handle.! 

Here we shall use the notation ¢ for the hyperparameters of the variational approxima- 
tion. Thus we write our approximating function g as g(@|¢), and the algorithm proceeds 
by starting with some guess of ¢ and then iteratively updating it in a way that is mathe- 
matically guaranteed to decrease the Kullback-Leibler divergence (13.15) at each step. As 
with EM, at some point @ no longer makes any visible changes. At that point we stop the 
iteration and use g(6|¢) given the most recent update of ¢ as our approximation to the 
posterior density. It can make sense to check the results by running the algorithm several 
times from different starting points. 


The class of approximate distributions 


There are various ways of defining the class of distributions for the variational approxima- 
tion, g(@|¢). A standard approach is to constrain the components of 0 to be independent; 
thus, g(6|¢) = Mi gi(0;|ġ;) for a J-dimensional parameter 0. In that case, the family of 
distributions gj over which to optimize can be determined from the mathematical form of 
the posterior density function, p(0|y). 

It works like this: for each j, we examine the expectation of the log posterior density, 
log p(@|y), considering it as a function of 0j, averaging over the distributions g_; that 
represent the other J—1 dimensions of 0. This is similar to Gibbs sampling except that 
we are interested in the average rather than the conditional density. At this point in 
setting up the variational Bayes algorithm, we do not yet need to evaluate the expectation 
E,_,(log p(6|y)); we merely need to figure out its mathematical form as a function of 0;. 
Once we have done this for each parameter 0j, we have determined the functional forms of 
the approximating distributions, g;(9;|¢,;). 


1In the literature on this algorithm, the approximating function is typically called q. We use g for consis- 
tency with our general notation in which g is an approximating density, while we reserve q for unnormalized 
densities, as in Chapter 10. 
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As with EM, variational Bayes works best on exponential family models with condition- 
ally conjugate prior distributions, in which case the approximating distributional families 
can typically be determined by inspection and the necessary expectations can be calculated 
in closed form. In nonconjugate models, variational Bayes can still be done by working 
within restricted functional forms such as normal distributions. 


The variational Bayes algorithm 


Once the classes of approximating distributions g;(@;|¢;) have been identified, the com- 
putation begins with guesses of all the hyperparameters ¢. We then cycle through the 
distributions gj, in each of these steps updating the vector of hyperparameters ¢; so that 
log 9;(0;|¢;) is set to Ey_, (log p(A|y)) = flog p(O|y)g-;(0-;|¢-;)de_,;. We use the nota- 
tion E,_, to indicate an average over the approximating distribution of all the parameters 
other than ĝ;, conditional on the current iteration of g_;. The result is a J—1-dimensional 
integral, but for many models we are able to evaluate these expectations analytically. 

We shall sketch the proof that the steps of variational Bayes decrease KL(g||p) and 
thus gradually bring the approximating distribution g(@) closer to the target posterior dis- 
tribution p(@|y). But first we show how the algorithm works in a simple but nontrivial 
example. 


Example. Educational testing experiments 

We illustrate with the hierarchical model for the 8 schools from Section 5.5. As with 
our demonstration of HMC in Section 12.5, we label the eight school effects (defined 
as 6; in Chapter 5) as œj; the full vector of parameters 0 then has 10 dimensions, 
corresponding to aj1,...,@g8, 4,7, and the log posterior density is 


1 (y; — a;)? Na ee 
log p(O|y) = > ge Ar a X (a; — u)? + const. (13.16) 
j=l 


j=1 


We shall follow standard practice with variational Bayes and approximate p(0) by a 
product of independent densities; thus, 


g(0) = glar,- -, ag, H, T) = glar): glas)glu)g(T). (13.17) 


Determining the form of the approximating distributions. We begin variational Bayes 
as follows. For each of the ten variables in the model, we inspect the log posterior 
density, consider its expectation averaging over the independent distribution g for the 
other nine variables, and determine its parametric form: 


e For each aj, we look at E log p, averaging over the other seven a’s, u, and T; that 
is, averaging (13.16) over all the factors of (13.17) except for g(a;). The result 
is a quadratic function of œj. For this inspection, the details of the averaging 
distributions g are irrelevant; it is enough to know that they are independent. All 
we need to do is look at (13.16) and consider what will happen if we average over 
all parameters other than a;. The result is, 


1 (y;— aj)? 1,1 
J 


JE((a;j — )?) + constant. 


where various expectations that do not involve aj can be swept into the constant 
term. To be more explicit, we could write the expectation above as Ega, log p, 
indicating that it averages over all the factors of g in (13.17) except for ga,- 


In any case, we recognize E log p as a quadratic function of a; and thus eP log ply) 
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is proportional to a normal density when considered as a function of œj. We can 
identify the parameters of this normal distribution by completing the square of the 
quadratic expression or, more intuitively from a statistical perspective, recogniz- 
ing the expression as equivalent to two pieces of information, one centered at y; 
with inverse-variance o7? and one centered at E(u) with inverse-variance E(+). 
We combine these by weighting the means and adding the inverse-variances, thus 
getting the following form for the variational Bayes component for a;: 


naa sry; + E()E(u) 1 
g(a;) =N | a; |2 ————_ |]. 13.18 
3 í + +E(4}) 4+, +E(}) 


e For u, we inspect (13.16). Averaging over all the parameters other than pu, the 
expression E log p(6|y) has the form —-4E(4) Di (Ela) — u)? + const. As above, 

this is the logarithm of a normal density function; the parameters of this distribution 

can be determined by considering it as a combination of 8 pieces of information: 


— 


(13.19) 


o| = 


8 
g(u) =N | u VEO). ST 5 


Y ml 


e Finally, averaging over all parameters other than 7 gives a density function that 
can be recognized as inverse-gamma or, in the parameterization we prefer, 


g(r?) = Inv-x? | 727,5 E), (13.20) 


with the expectation E((a; — p)?) over the approximating distribution g. 
The above expressions are essentially identical to the derivations of the conditional 
distributions for the Gibbs sampler for the hierarchical normal model in Section 11.6 
and the EM algorithm in Section 13.6, with the only difference being that in the 
8-schools example we assume the data variances goj are known. 


Determining the conditional expectations. Rewriting the above factors in generic no- 
tation, we have: 


g(a3) = N(aj| Ma, Sk) for 7 =1,...,8 (13.21) 
glu) = N(wlM,, Sa) (13.22) 
g(r?) = Inv-x?(7?|7, M?). (13.23) 


We will need these to get the conditional expectations for each of the above three 
steps: 


e To specify the distribution for a; in (13.18), we need E(u), which is M, from 
(13.22), and E(4), which is 74, from (13.23). 


e To specify the distribution for p in (13.19), we need E(aj), which is Ma, from 
(13.21) and E(44), which is zł from (13.23). 


e To specify the distribution for 7 in (13.20), we need E((a; — )”), which is (Ma, — 
M,)? + S2, + S4 from (13.21) and (13.22), and using the assumption that the 
densities g are independent in the variational approximation. 
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Figure 13.5 Progress of variational Bayes for the parameters governing the variational approxima- 
tion for the hierarchical model for the 8 schools. After a random starting point, the parameters 
require about 50 iterations to reach approximate convergence. The lower-right graph shows the 
Kullback-Leibler divergence KL(g||p) (calculated up to an arbitrary additive constant); KL(g||p) is 
guaranteed to uniformly decrease if the variational algorithm is programmed correctly. 


Starting values. To get variational Bayes started, we need to initialize, not the vari- 
ables a, 4,7, but the parameters in their distributions g(a), g(u),g(T). For simplic- 
ity we draw the unbounded parameters M,,,..., Mag, M, from independent N(0, 1) 


distributions and draw the bounded parameters So,,...,Sag,5, from independent 
U(0, 1) distributions. 
Running the algorithm. We iterate through a1,...,ag,U,7, at each iteration updat- 


ing the distributions (13.18)—(13.20) using the expectations from the current values 
of the other distributions. That is, we compute the parameters in (13.18)—(13.20) by 
plugging in the expectations described in the bullet points above, using the current 
values of M’s and S’s. Then we turn around and label the newly computed means 
and standard deviations in (13.18)—(13.20 as the updated M’s and S’s. The algorithm 
thus looks a lot like EM, with the difference that it is distributions, rather than point 
estimates, that are being updated. The first five plots in Figure 13.5 show the progress 
of the parameters of the distributions (13.21)—(13.23). With these particular starting 
points, the algorithm takes awhile to get moving, but once it gets unstuck, it quickly 
finds a solution. 


Checking that the fit is improving. As noted above, the Kullback-Leibler divergence 
(13.15) should decrease in each step of variational Bayes. In this example we can 
evaluate the expression analytically and so we do. Because we are comparing these 
values and do not care about their absolute level, so we can simplify our analysis by 
ignoring constants that do not depend on the parameters 0. 
For the example at hand, the Kullback-Leibler divergence is, 


=F, (10g (2Ø®)) = - E, Qog pll) +E, (0890) 
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Figure 13.6 Progress of inferences for the effects in schools A, B, and C, for 100 iterations of 
variational Bayes. The lines and shaded regions show the median, 50% interval, and 90% interval 
for the variational distribution. Shown to the right of each graph are the corresponding quantiles 
for the full Bayes inference as computed via simulation. 


8 8 2 2 2 

1 (y - Ma} + 8? 1^ (Ma — My)? +S +S 

=- a + Blog Me + = ec a a 
pr eae m? 


J 
— 5 log Sa — log S,, — J log M, + constant. 


j=1 


The lower-right plot in Figure 13.5 shows the steady decrease in KL(g||p) as the 
algorithm progresses. 


Comparing variational and full Bayes solutions. Figure 13.6 shows the progress of 
the variational algorithm for three of the parameters, a1, a2, &3, corresponding to the 
effect of the coaching programs of the first three schools in the dataset. The functional 
form here is Gaussian so it will necessarily fail at capturing some of the subtleties of 
the posterior distribution, as can be seen by the comparison to the asymmetric full 
Bayes intervals in this example. In addition, this variational fit does not allow for 
dependence among the a,;’s and thus would be inappropriate for some purposes. That 
said, the approximation fits the marginal distributions fairly well, and variational 
Bayes represents a fast and scalable approach for inference in this sort of problem 
with large datasets. 

Unlike MCMC, which eventually converges to the posterior density, the variational 
inference converges to an approximation—the closest fit within a restricted class. So 
in a case like this, where we can also run MCMC long enough for convergence, it makes 
sense to try to understand variational Bayes by comparing it to the actual posterior 
density. 


Proof that each step of variational Bayes decreases the Kullback-Leibler divergence 


The Kullback-Leibler divergence (13.15) is defined using the (normalized) posterior density, 
p(0|y). The first step in understanding variational Bayes is to re-express in terms of the 
unnormalized density, p(0,y) = p(0)p(y|0) which we can calculate. The re-expression goes 


like this: p(@|y) = — so log p(6|y) = log p(0, y) — log p(y), and thus, 


Kyle) = — fioe (2 
( 
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o 
=> Lif (108 (2 1) ) + log p(y). (13.24) 
g(0) 


We cannot in general evaluate that last term, log p(y), but for the purpose of variational 
inference all we need to realize is that it does not depend on g. Hence, for any given model 
p and data y, decreasing the first term on the right of (13.24) is equivalent to decreasing 


KL(g||p). This expression, Eg (log (22)), is called the variational lower bound. 


g(0) 

Next we show that each step of the variational algorithm is guaranteed to increase 
the variational lower bound, or equivalently to decrease the global Kullback-Leibler diver- 
gence (13.24). Consider the step where the distribution g;(6;) is being updated. We shall 
decompose the variational lower bound using the factorization we have assumed for the 
approximating distribution: g(0) = g;(0;)g—;(0_-3): 


Pelee (Far) 


J| 00) ( log p(9, y) + log g; (8;) + log v-3(0-)) d0;d0_; 
= faw (2 log p(0, sya) dO; + 


+ f 95(6;)1ox.9)(6;)48, +f j(@_;) log g_;(@_,)d0_,. (13.25) 


We were able to turn the double integrals into single integrals in the last line above because 
we have assumed that gj and g_; are normalized probability densities and thus integrate 
to 1. 

Here we are considering only the step at which gj is being updated, so we can ignore 
the last term in (13.25) as it only depends on g_;, and we can consider the expression in 
brackets in the first term to be (temporarily) a constant in that it does not depend on g;: 


Es, log (8,1) = | 9-s(0-)1087(0, y)d8-5. (13.26) 


As 6_; has been integrated out, expression (13.26) is a function only of 0; and y, and it 
can be considered as the logarithm of an unnormalized probability density of 6;, which we 
shall call, 

log p(9;) = Eg_, log p(9, y) + constant. (13.27) 


That is, log p is the (unnormalized) log density you get by considering log p(6|y) as a function 
of 6; and taking the expectation over all the other components of 0, averaging the current 
iteration of g_, (as illustrated in detail for the 8-schools example earlier in this section). 
Expression (13.25) then becomes, 


=e, (ing (222)) =- foseyron (222) ce, teost. 1828) 


This last expression is just KL(g,||p), the Kullback-Leibler divergence of g; with respect to 
p, and it is minimized when g;(0;) = p(0;) as defined in (13.27). 

Thus, when it is possible to evaluate the expectations in (13.26) and thus determine 
gj, we have our update, and the variational algorithm is guaranteed to decrease the global 
Kullback-Leibler divergence (13.15) at each step. If the expectation (13.26) cannot be done 
in closed form, what is needed is some update to gj that decreases (13.28), thus bringing 
gj closer to p in that step. 


Model checking 


The variational Bayes approximation is a generative model—that is, a proper probability 
distribution for the parameters 0; thus we can check its fit by drawing a sample 6°, s = 
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1,...,S from the fitted g(@) and then, for each 0°, drawing a replicated dataset y™°™5. For 
the 8-schools example, we would first draw 1000 replications of the parameters aj,...,@g 


from the final g in our variational Bayes calculation, then sample 1000 replications for data 
for the 8 schools. Or, if we were interested in predictions for new schools, we would first 
draw 1000 values of (u, T) from g, then, for each of these simulations, draw 8 new schools 
from N(,77), and then draw one new data observation from each new school (conditional 
on some assumed øg). 

In general we would expect that, compared to full Bayes, variational inferences would 
provide a better fit to observed data. As a point estimate of the distribution, variational 
Bayes can overfit the data. Nonetheless, a model check can be a good idea as it could still 
reveal problems with the inferences. 


Variational Bayes followed by importance sampling or particle filtering 


Variational methods are commonly used as an approximate method when simulation-based 
full Bayes is too computationally expensive, as with very large models or datasets. In such 
cases it might make sense to use the variational estimate as a starting point for a stochastic 
algorithm leading to a better approximation to the target distribution. 

The simplest idea would be importance sampling: in the sorts of problems where vari- 
ational methods are tractable, we can easily compute both the target density p(@|y) and 
the approximation g(@), and we can also get fast simulations directly from g. We can 
then compute S' simulation draws, 0° from g and, for each, compute the importance weight 
p(0°|y)/g(0°). (As usual, we only need these weights up to an arbitrary multiplicative con- 
stant; thus it would be fine to use unnormalized densities in place of g or p.) We could then 
compute expectations using weighted averages. 

Unfortunately, the direct use of importance weighting from a variational approximation 
can be disastrous, because the variational Bayes fit tends to be less variable than the target 
distribution, hence the distribution of importance ratios can have long tails, leading to 
unstable averages. So instead we would recommend importance resampling, in which we 
first draw from g, then resample without replacement using the importance weights. (It is 
crucial to resample without replacement so that any sampled points with extremely high 
weights do not dominate the simulations.) As in general with importance resampling, it is 
not always clear how many draws to take at each of the two stages of sampling. 

A more general approach would be particle filtering, again using draws from the varia- 
tional Bayes as a starting point and then moving through the target density using Metropolis 
or Hamiltonian Monte Carlo and splitting and removing points as appropriate. Implement- 
ing this for any particular example could represent a large investment in programming 
time, but for a large problem, or in a computing environment in which particle filtering has 
already been set up, it could make sense. 


EM as a special case of variational Bayes 


Variational inference proceeds in J steps, each time updating one conditional distribution 
gj, averaging over the other factors of g. EM has two steps (the E-step and the M-step), 
alternately estimating a parameter ¢@ and averaging over the other parameters y. EM can 
be seen as a special case of variational Bayes in which (a) the parameters are partitioned 
into two parts, ¢ and y, (b) the approximating distribution for ¢ is required to be a point 
mass (thus, updating g(@) is equivalent to updating the point estimate of ¢), and (c) the 
approximating distribution for y is unconstrained; thus g(y) = p(4|@, y), conditional on the 
most recent update of @. 
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More general forms of variational Bayes 


Variational inference as described above is an approach for approximating a target distribu- 
tion p over some class of approximating distributions g using an iterative algorithm that at 
each step reduces the Kullback-Leibler divergence, KL(g||p). As noted above, this can be 
done in closed form only for models with certain conjugacy properties. Such models include 
many important special cases (such as normal distributions and finite mixtures), but more 
generally the idea of variational Bayes can be extended by replacing the objective function 
(the criterion to be minimized) with an approximation of KL(g||p). For many problems, 
including logistic regression, good approximations are available, so that an algorithm that 
optimizes over this new criterion yields a good approximation to the posterior distribution. 
Typically the approximation itself changes with each step, being defined based on the most 
recent update of the approximating function g. 


13.8 Expectation propagation 


Expectation propagation (EP) is another deterministic iterative algorithm in which the 
posterior distribution p(@|y) is approximated by a best-fit distribution from some specified 
parametric family. Expectation propagation differs from variational Bayes in its optimiza- 
tion criterion and also in the nature of how it is computed. We shall first describe the 
algorithm in general and then go through the steps of applying it to logistic regression. 

We start with the target distribution p(@|y), which we shall write as f(0), suppressing 
the dependence on y which is not directly relevant for these computations. We assume that 
f has some convenient factorization, 


FO =] fi(9). (13.29) 
i=0 


As with many Bayesian computations, all we need for the f;’s are the unnormalized density 
functions. 

Expectation propagation can be expressed more generally, but in this description it is 
convenient to think of fo(@) as the prior density and each f;(@) as the likelihood for one 
data point. The computational advantage of the factorization arises from the possibility of 
computing certain expectations rapidly when the density is factorized in this way, as we 
discuss below. 

As with variational Bayes, expectation propagation works by iteratively approximating 
the target distribution by some g(@) which is constrained to follow some parametric form. 
The algorithm begins with some guess for g and then proceeds via an iterative updating. 
A key difference between the two methods is that variational inference is typically based 
on a separation of g into factors for each parameter (thus, g(0) = Ma g;(0;)), whereas 
expectation propagation factorizes g based on a partition of the data; thus, 


9(9) = [[ (0). (13.30) 
1=0 


At convergence, each factor of (13.30) is intended to approximate the corresponding factor 
of (13.29). The trick is that these approximations are done one at a time, but in the context 
of the entire distribution. 


Exponential families, sufficient statistics, and natural parameters. The approximating dis- 
tribution g(@) should be in the exponential family—as discussed on page 36, this means that 
the density can be written as a normalizing function times the exponential of a linear func- 
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tion of sufficient statistics of 0. For example, take the normal distribution: 


1 1 
N(9|u,07) = Sno Y (-s20- 0") 
1 1 H, P 
= — —+50-—]}. 
V 270 Cap ( 20? o? £) 


In this parameterization, the normalizing function is the ugly-looking 7 exp(— 45), but 


this does not really matter. What is important are the sufficient statistics, 0 and 67. In 
expectation propagation, we compute the expectations of the sufficient statistics under 
various combinations of the approximating distribution and the target distribution. The 
coefficients of the sufficient statistics inside the exponential of the above expression are 
called the natural parameters of the model. 

In typical applications of expectation propagation, the approximating distribution is 
restricted to the multivariate normal family; thus, g(@) = N(0|u, £). Here there are two 
sufficient statistics: the vector 0 and the outer-product matrix 067, and the corresponding 
natural parameters are proportional to the scaled mean vector =~! and the precision 
matrix 7!. 


The expectation propagation algorithm. At each step of the iterative algorithm, we take the 


current approximating function g(@) and pull out the approximating factor g;(0), replacing it 
by the corresponding factor f;(0) from the target distribution. We define the (unnormalized) 


cavity distribution: g_;(@) « 


and the 
tilted distribution: g_;(0) fi(0). 


We then construct an approximation to the tilted distribution, using a moment-matching 
approach described below. This approximation is the updated g(@). We then back out the 
updated approximating factor, g;(@) = g(@)/g_i(0). The result is that we have a new g;(0) 
which approximates f;(0), in the context of g_;. This also explains why the algorithm needs 
to iterate, as the context changes with each step until convergence. 


Moment matching. The core of the expectation propagation algorithm occurs within each 
step, to construct the approximation of the tilted distribution, g_;(0) f;(@), within the para- 
metric form specified for g(@). The way this is done is by matching moments: that is, 
setting the expectations of the sufficient statistics of g to the corresponding expectations of 
O in g-i() fi(@). 

For example, if g(@) has the form N(0|u, £), then in the moment-matching step we set 
H = Exitteai(9) = fOg—i(9) fi(0)d0 and X = vartitteai(0) = f (0 — )(0 — u)” g-:(0) fi(0)d0. 

The difficult part of this step is computing these expectations, which in theory could 
require an integration over a high-dimensional space (that of the entire parameter vector 
0). In practical implementations of expectation propagation, these integrals can be done in 
closed form or via a transformation that reduces the problem to a low-dimensional integral. 
What makes this work is that the tilted distribution is mostly g_; (which is easy to handle 
because it follows a specified parametric form such as the multivariate normal) with only 
one difficult factor f;. For many models, f; can be expressed in such a way that its integral 
over g_; is well behaved. 

If g is updated after each moment-matching step, the algorithm is called sequential EP, 
whereas if g is updated only after all tilted moments have been computed the algorithm is 
called parallel EP. Parallel EP is typically much faster as it requires less frequent updates 
of the higher-dimensional function g. 
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Moment matching corresponds to minimizing the Kullback-Leibler divergence from the 
tilted distribution to the new approximated marginal distribution, but the iterative match- 
ing of the marginals does not guarantee that the Kullback-Leibler divergence from the full 
posterior distribution to the overall approximation is minimized. There is no guarantee of 
convergence for EP, but for models with log-concave factors f; and initialization to the prior 
distribution, the algorithm has been used successfully in many applications. 


Expectation propagation for logistic regression 


Consider the model of independent data y; ~ Bin(m;, logit~'(X;0)), i=1,...,n, with prior 
distribution 0 ~ N(f19, Uo). Here, X; is the ith row of the n x k matrix X of predictors. 
It is not difficult to iteratively solve for the (k-dimensional) posterior mode of 0 and then 
compute the second derivative matrix of the log posteror density, thus obtaining a mode- 
centered normal approximation (for details, see Section 16.2), but we can get a better 
normal approximation using expectation propagation, as follows. 

We use a normal approximating function with factors g;(0) = N(m;, Xi), i = 0,...,n. 
We set go to equal the prior distribution and, to start, for each i = 1,...,n, set the natural 
parameter D u to the zero vector and =e to the identity matrix Iy times some positive 
number, corresponding to a starting distribution that is precise enough to be computation- 
ally stable but not so sharply localized that the algorithm is slow to move from its initial 
value. 

The iteration proceeds by stepping through the data points. For each i: 


1. Compute the parameters of the cavity distribution, g_;(0) = N(w—i, S_:): 


Dimi = Ep- Uy 
=i -1 =f 
Si = yiyi, 


2. Project the cavity distribution onto the one-dimensional subspace represented by the 
data vector X;. The projected distribution is a one-dimensional normal with mean and 
variance, 


M-i = Xip-i 
Vi = ee. 
Steps 1 and 2 can be combined so that only scalar moments X;OX/ and X;u are required: 
Mai = Va (2X2) Xu- VIM) 
Vs Coa uae 
3. Define the (unnormalized) tilted distribution of ņn = X;0: 
g-m) fi(n) = N(n| M-i, V-«)Bin(ys|ma, logit" (n)). 


Compute moments 0, 1, 2 of this unnormalized distribution to get moments of the tilted 
distribution of 1: 


Ex = [nto fade, for k = 0,1,2. 


Compute M = ot and V = e — (B, the mean and variance of the tilted distribution of 
n, using numerical integration. We use the iterative Gauss-Kronrod quadrature method. 
To perform these (one-dimensional) integrals, we need lower and upper bounds of integra- 
tion. Ideally we would do this based on the mode and curvature of the tilted distribution 
but for simplicity we might just use M_;+6,/V_;, based on the mean and standard de- 
viation of the cavity distribution. The multiplier 6 is set to some large number such as 10 


to ensure that the mass of the tilted distribution is contained in the range of integration. 
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4. Subtract off the cavity distribution to get the moments of the updated approximating 
factor gi(7n) = N(n| Mi, Vi): 


M M MGi 
LA V Vs 
1O 1 1 

VOo V Vir 


5. Transform these to get the moments of the updated approximating factor defined on the 
full space, g:(0) = N(6|4i, Us): 


M;i 
Sly = XP 
ib i, 
= pe a 
V; 


Recall that M; and V; are scalars, ee pe isa k x 1 vector, and D is a k x k matrix. 


6. Combine this updated g; with the cavity distribution g_; to get the updated approxi- 
mating distribution, g(0) = N(u, £). This is done by adding the natural parameters of 
the component parts: 


Diy = Dipit Er m 
Da = Ve 


This step is skipped in parallel EP, and only after updating all approximating factors 
gi(@) the updated posterior is computed. With large n and k, this saves computation 
time. 

7. Now return to step 1, updating a new i. 


What makes the algorithm computationally feasible is that, in each step, the relevant factor 
of the likelihood depends on the parameters only through the linear combination X;0. It 
is a different linear combination at each step, but during any particular step, the required 
integrals are one-dimensional. 

In addition, the algorithm operates just as easily for any fixed normal prior distribution 
on @, as this just folds into the factor go. For a hierarchical model in which the model 
contains additional hyperparameters, another step is needed. 

We illustrate expectation propagation for a simple logistic regression with uniform prior 
distribution. 


Example. Bioassay logistic regression with two coefficients 

Section 3.7 describes an experiment on 20 rats in four groups of 5, each group exposed 
to a different level of a toxin. For consistency with the notation immediately above, 
we write the model as y; ~ Bin(m;, logit~'(@; + 622;)), i= 1,...,4. The row vector 
of data for observation i is then X; = (1,2;). The model is completed with a flat prior 
distribution on 0 = (61,02). As usual in such settings, this uniform prior distribu- 
tion is not a reasonable summary of any scientific understanding of the problem but 
rather serves as a placeholder, with the understanding that it can be augmented with 
substantive information as needed. 

The data are in Table 3.1 on page 74. In Section 4.1 we fit the basic mode-based 
normal approximation, yielding a point estimate of (0.8, 7.7) and a covariance matrix 
as shown in Figure 4.1 on page 86. For comparison, Figure 3.3 on page 76 displays 
the contours of the exact posterior density. The actual density is skewed, with a long 
tail toward large values of 6; and 62, and so we would hope that our approximation 
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Figure 13.7 Progress of expectation propagation for a simple logistic regression with intercept and 
slope parameters. The bivariate normal approximating distribution is characterized by a mean 
and standard deviation in each dimension and a correlation. The algorithm reached approximate 
convergence after 4 iterations. 
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Figure 13.8 (a) Progress of the normal approximating distribution during the iterations of expec- 
tation propagation. The small ellipse at the bottom (which is actually a circle if x and y axes 
are placed on a common scale) is the starting distribution; after a few iterations the algorithm 
converges. (b) Comparison of the approximating distribution from EP (solid ellipse) to the simple 
approximation based on the curvature at the posterior mode (dotted ellipse) and the exact poste- 
rior density (dashed oval). The exact distribution is not normal so the EP approximation is not 
perfect, but it is closer than the mode-based approximation. All curves show contour lines for the 
density at 0.05 times the mode (which for the normal distribution contains approximately 95% of 
the probability mass; see discussion on page 85). 
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using expectation propagation would move in that direction, compared to the mode, 
to better fit the full distribution. 

We then run the algorithm, with starting values Dy Ma = 0 and ue = I for i = 
1,...,4. During the progress of the iterations we keep track of the 2 x 1 vector ©~ ty 
and the 2 x 2 matrix 5~!; these are the natural parameters of the normal approximat- 
ing distribution. To understand these better we reparameterize as H1, H2, 01, 02, p. Fig- 
ure 13.7 shows the progress of these parameters over 10 iterations of the algorithm—a 
total of 40 steps—which is more than enough in this case for practical convergence. 
Figure 13.8 compares the final approximating distribution from expectation propa- 
gation to the simpler normal approximation based on the curvature at the posterior 
mode, and to the exact posterior density. In this example, EP performs well. The 
approximation is shifted toward the mass of the distribution, as we would hope. 


In summary, expectation propagation is an appealing algorithm. It is fast and direct to 
implement. Unlike EM, it approximates the entire distribution rather than just supplying 
a point estimate; and, unlike usual implementations of variational Bayes, it fits the joint 
distribution rather than just the margins. Expectation propagation can be difficult to apply 
in general settings, however, as it requires a likelihood or prior factorization in which the 
required integrals can be expressed in some simple form. An active goal of research for 
all these deterministic approximate methods is to develop general implementations that 
can work with arbitrary density functions, as can now be done stochastically using Gibbs, 
Metropolis, and Hamiltonian Monte Carlo. 


Extensions of expectation propagation 


There is a provably convergent slower double-loop algorithm for EP, which can be combined 
with regular EP so that the slower algorithm is only used if regular EP does not converge. 

Sometimes convergence of EP can be improved by using damping, that is, by making only 
partial updates of g after moment matching. Fractional EP (or power EP) is an extension 
of EP which can be used to improve stability when the approximation g is not flexible 
enough or when the propagation of information is difficult due to vague prior information. 
Fractional updating can be viewed as minimization of a-divergence which includes the 
directed Kullback-Leibler divergences and the Hellinger distance as special cases. Fractional 
EP provides flexibility in choice of minimized divergence and can also be used to improve 
convergence and to recover standard EP by setting a close to 1 in the final iterations. 
Improved marginal posteriors for 6; can be obtained by applying expression (13.9) or faster 
approximations of that. 


13.9 Other approximations 
Integrated nested Laplace approximation (INLA) 


Another form of posterior approximation involves partitioning the parameters into a large 
set y conditional on a smaller set of hyperparameters ¢. The idea is to construct a joint 
Gaussian approximation for p(y|¢,y) and apply expression (13.9) to approximate both 
p(dly) and p(7|¢,y). Approximations to p(y;|y) are obtained by numerically integrating 
over the low dimensional Papprox(¢|y) (hence the name integrated nested Laplace approxi- 
mation). INLA works best when there are not many hyperparameters in the model, because 
then the space of hyperparameters is small enough that their marginal posterior distribu- 
tion can be reasonably approximated by a sample on some discrete grid. The algorithm 
was developed for hierarchical models in which the parameters for the data model have a 
joint normal prior, so that the conditional normal approximation is easily constructed. 
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Central composite design integration (CCD) 


If we like to improve over modal approximation, but the computation of p(¢|y) is costly, 
we want to minimize the number of evaluation points around the mode. Clever determin- 
istic placement of points can provide lower variance using the same number of posterior 
evaluations as sampling based approaches. Central composite design is a useful method 
for obtaining a moderate number of representative points from posteriors having moder- 
ate dimensionality. For example, a 5-dimensional model uses 27 integration points under 
this method, while a 15-dimensional model uses 287 points. CCD uses a fractional factorial 
design to avoid the exponential increase of the number of evaluation points when the dimen- 
sionality of the posterior increases while allowing to estimate the curvature of the posterior 
distribution around the mode. The integration is a finite sum with special weights. The 
accuracy of CCD is between modal approximation and full integration with a grid or Monte 
Carlo. We use CCD in Chapter 21 for Gaussian processes. 


Approximate Bayesian computation (ABC) 


The term ‘approximate Bayesian computation’ is applied to a set of statistical procedures 
based on drawing parameters 0 from an initial or approximate distribution, then sampling 
replicated data y™°P|0 from the model, and then accepting or rejecting the sample based 
on the closeness of y"°P to the observed data y. The attraction of approximate Bayesian 
computation is that it does not require computation of the likelihood function, only the 
ability to simulate y’°?|@ from the data distribution; the difficulty is in the assessment of 
the closeness of y"°P to y. 
The most basic form of ABC has the form of simple rejection sampling: 


e Draw 0 from the prior distribution p(@) and then y™°P from the data distribution, p(y"*?|6), 
thus obtaining a single draw of y"*P from its marginal distribution. 


e Compute a discrepancy measure d(y**?, y), where d is defined so that it is zero if y and 
y™®P are identical and is larger the more ‘different’ they are, in some relevant dimensions. 


e Accept 0 if d(y™°P, y) < € for some preset threshold €, otherwise reject. 


The result is to accept draws from the prior distribution in proportion to the probability 
that they yield replicated data that are close to the observed data. This latter probability 
is approximately the likelihood, hence the accepted set of simulation draws is an approxi- 
mation of the posterior distribution. 

ABC involves three challenges. First, one needs to define a discrepancy measure d, 
which ideally should capture the aspects of the data that are relevant for estimating the 
parameters in the model (that is, the sufficient statistics) without requiring y™°P to match 
y on irrelevant ‘noise’ dimensions. Second, € needs to be set small enough that the data 
provide information, but not so small that all (or almost all) the simulations get rejected. 
Third, if the prior distribution is broad enough, the rejection rate can be unacceptably high 
even if the discrepancy measure and threshold have been chosen well. 

These challenges can be partly addressed by combining ABC with other ideas of posterior 
simulation. For example, it is not necessary to draw from the prior; one can draw simulations 
from another distribution and then correct using importance sampling. Or one can use 
MCMC steps to move in interesting regions of parameter space. As with many ideas in 
Bayesian simulation, research is stimulated by the practical challenges of approximating 
certain distributions that arise in practice. 

A related idea is substitution likelihood, in which one uses a rank likelihood or a like- 
lihood that only depends on quantiles and not what happens in between them, in place of 
a full likelihood specification. These almost-likelihoods are put in place of the likelihood in 
Bayes rule. The advantage of this approach is that it allows a specified joint distribution 
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model (which is sometimes called a copula) to be applied in settings where the marginal dis- 
tribution would not fit. This is thus a computational approximation that allows a popular 
class of statistical models to be applied more broadly. 


13.10 Unknown normalizing factors 


Finally, we discuss the application of numerical integration to compute normalizing factors, 
a problem that arises in some complicated models that we largely do not discuss in this 
book. We include this section here to introduce the problem, which is an active area of 
research; see the bibliographic note for some references on the topic. 

Most of the models we present are based on combining standard classes of models for 
which the normalizing constants are known; for example, all the distributions in Appendix 
A have exactly known densities. Even the nonconjugate models we usually use are combi- 
nations of standard parts. 

For standard models, we can compute p(@) and p(y|0) exactly, or up to unknown mul- 
tiplicative constants, and the expression 


p(y) x p()p(ylA) 


has a single unknown normalizing constant—the denominator of Bayes’ rule, p(y). A similar 
result holds for a hierarchical model with data y, local parameters y, and hyperparameters 
ġ. The joint posterior density has the form 


P(Y Oly) x p()p(71¢)P(yl7; o), 


which, once again, has only a single unknown normalizing constant. In each of these situa- 
tions we can apply standard computational methods using the unnormalized density. 


Unknown normalizing factors in the likelihood. A new and different problem arises when 
the sampling density p(y|0) has an unknown normalizing factor that depends on 0. Such 
models often arise in problems that are specified conditionally, such as in spatial statistics. 
For a simple example, pretend we knew that the univariate normal density was of the 
form p(y|u, o) x exp(—s4z(y — 1)”), but with the normalizing factor 1/(vV2ro) unknown. 
Performing our analysis as before without accounting for the factor of 1/ø would lead to 
an incorrect posterior distribution. (See Exercise 10.11 for a simple nontrivial example of 
an unnormalized density.) 
In general we use the following notation: 


1 
p(ylð) = t0), 


where q is a generic notation for an unnormalized density, and 


z(0) = fioa (13.31) 


is called the normalizing factor of the family of distributions—being a function of 0, we can 
no longer call it a ‘constant’—and q(y|0) is a family of unnormalized densities. We consider 
the situation in which q(y|@) can be easily computed but z(@) is unknown. Combining the 
density p(y|@) with a prior density, p(@), yields the posterior density 


1 
Aly) x p(6)—q(y|@). 
(Glu) nO) zall) 
To perform posterior inference, one must determine p(@|y), as a function of 0, up to an 
arbitrary multiplicative constant. 
An unknown, but constant, normalizing factor in the prior density, p(@), causes no 
problems because it does not depend on any model parameters. 
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Unknown normalizing factors in hierarchical models. An analogous situation arises in hi- 
erarchical models if the population distribution has an unknown normalizing factor that 
depends on the hyperparameters. Consider a model with data y, first-level parameters 
y, and hyperparameters ¢. For simplicity, assume that the likelihood, p(y|y), is known 
exactly, but the population distribution is only known up to an unnormalized density, 
q(y|¢) = z(¢)p(y|¢). The joint posterior density is then 


p(y, bly) x p(d)—alrld)p(uly), 


z(ġ) 
and the function z(@) must be considered. If the likelihood, p(y|y), also has an unknown 
normalizing factor, it too must be considered in order to work with the posterior distribution. 


Posterior computations involving an unknown normalizing factor 


A basic computational strategy. If the integral (13.31), or the analogous expression for the 
hierarchical model, cannot be evaluated analytically, numerical integration can be used, 
perhaps involving more advanced approaches such as bridge and path sampling, discussed 
below. An additional difficulty is that one must evaluate (or estimate) the integral as a 
function of 0, or ¢ in the hierarchical case. The following basic strategy, combining analytic 
and simulation-based integration methods, can be used for computation with a posterior 
distribution containing unknown normalizing factors. 


1. Obtain an analytic estimate of z(@) using some approximate method, for example Laplace’s 
method centered at a crude estimate of 6. 


2. Construct an approximation to the posterior distribution, as discussed in Chapter 13. 
Such approximations can often be integrated directly. 


3. For more exact computation, evaluate z(@) (see below) whenever the posterior density 
needs to be computed for a new value of 6. Computationally, this approach treats z(@) 
as an approximately ‘known’ function that happens to be expensive to compute. 


Other strategies are possible in specific problems. If 6 (or ¢ in the hierarchical version 
of the problem) is only one- or two-dimensional, it may be reasonable to compute z(6) 
over a finite grid and interpolate to obtain an estimate of z(0) as a function of 0. It is 
still recommended to perform the approximate steps 1 and 2 above so as to get a rough 
idea of the location of the posterior distribution—for any given problem, z(@) needs not be 
computed in regions of 0 for which the posterior probability is essentially zero. 


Computing the normalizing factor. The normalizing factor can be computed, for each value 
of 0, using any of the numerical integration approaches applied to (13.31). Applying approx- 
imation methods such as Laplace’s is fairly straightforward, with the notation changed so 
that integration is over y, rather than 0, or changed appropriately to evaluate normalizing 
constants as a function of hyperparameters in a hierarchical model. 

The importance sampling estimate is based on the identity 


— fay) 7 q(yl9) 
a -f g(y) gly)dy = Es ( gly) ) ? 


where Eg averages over y under the approximate density g(y). The estimate of z(0) is 


4 Di q(y*|0)/g(y*), based on simulations y* from g(y). Again, estimation of a normaliz- 
ing factor for a hierarchical model is analogous. 

Some additional subtleties arise, however, when applying this method to evaluate z(0) 
for many values of 0. First, we can use the same approximation function, g(y), and in fact 
the same simulations, y',...,y%, to estimate z(0) for different values of 0. Compared to 
performing a new simulation for each value of 6, using the same simulations saves computing 


This electronic edition is for non-commercial purposes only. 


13.10. UNKNOWN NORMALIZING FACTORS 347 


time and increases accuracy (with the overall savings in time, we can simulate a larger 
number S of draws), but in general this can only be done in a local range of 6 where the 
densities q(y|0) are similar enough to each other that they can be approximated by the 
same density. Second, we have some freedom in our computations because the evaluation 
of z(@) as a function of 0 is required only up to a proportionality constant. Any arbitrary 
constant that does not depend on @ becomes part of the constant in the posterior density 
and does not affect posterior inference. Thus, the approximate density, g(y), is not required 
to be normalized, as long as we use the same function g(y) to approximate q(y|@) for all 
values of 6, or if we know, or can estimate, the relative normalizing constants of the different 
approximation functions used in the problem. 


Bridge and path sampling 


When computing integrals numerically, we typically want to evaluate several of them (for ex- 
ample, when computing the marginal posterior densities of different models) or to compute 
them for a range of values of a continuous parameter (as with continuous model expansion 
or when working with models whose normalizing factors depend on the parameters in the 
model and cannot be determined analytically). 

In these settings with a family of normalizing factors to be computed, importance sam- 
pling can be generalized in a number of useful ways. Continuing our notation above, we let 
@ be the continuous or discrete parameter indexing the family of densities p(y|¢,y). The 
numerical integration problem is to average over y in this distribution, for each @ (or for a 
continuous range of values ¢). In general, for these methods it is only necessary to compute 
the densities p up to arbitrary normalizing constants. 

One approach is to perform importance sampling using the density at some central 
value, p(y|ġx, y), as the approximating distribution for the entire range of ¢. This approach 
is convenient as it does not require the creation of a special Papprox but rather uses a 
distribution from a family that we already know how to handle (probably using Markov 
chain simulation). 

If the distributions p(y|¢, y) are far enough apart that no single ¢, can effectively cover 
all of them, we can move to bridge sampling, in which y is sampled from two distributions, 
p(y\¢o, y) and p(y|¢1,y). Here, do and ¢; represent two points near the end of the space of 
@ (think of the family of distributions as a suspension bridge held up at two points). The 
bridge sampling estimate of the integral for any ¢ is a weighted average of the importance 
sampling estimates given ġo and ¢;. The weights depend on ġ and can be computed using 
a simple iterative formula. 

Bridge sampling is a general idea that arises in many statistical contexts and can be 
further generalized to allow sampling from more than two points, which makes sense if the 
distributions vary widely over ¢. In the limit in which a sample is drawn from the entire 
continuous range of distributions p(y|¢o, y) indexed by ¢, we can apply path sampling, a 
differential form of bridge sampling. In path sampling, a sample (y, ¢) is drawn from a joint 
posterior distribution, and the derivative of the log posterior density, dlog p(y, é|y)/d¢@, is 
computed at the simulated values and numerically integrated over ¢ to obtain an estimate of 
the log marginal density, log p(¢|y), over a continuous range of values of ¢. This simulation- 
based computation uses the identity, 


d 
T% log p(¢ly) = E(U (7, 6, y)lġ; y), 
where U (y, ¢, y) = dlog p(y, é|y)/dd. Numerically integrating these values gives an estimate 
of log p(¢|y) (up to an additive constant) as a function of ¢. 
Bridge and path sampling are related to parallel tempering (see page 299), which uses 
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a similar structure of samples from an indexed family of distributions. Depending on the 
application, the marginal distribution of @ can be specified for computational efficiency 
or convenience (as with tempering) or estimated (as with the computations of marginal 
densities). 


13.11 Bibliographic note 


An accessible source of general algorithms for conditional maximization (stepwise ascent), 
Newton’s method, and other computational methods is Press et al. (1986). Gill, Murray, 
and Wright (1981) is a classic book that is useful for understanding more complicated 
optimization problems. 

The boundary-avoiding prior densities in Section 13.2 are discussed by Chung, Rabe- 
Hesketh, et al. (2013a,b). 

Laplace’s method for integration was developed in a statistical context by Tierney and 
Kadane (1986), who demonstrated the accuracy of applying the method separately to the 
numerator and denominator of (13.3). Extensions and refinements were made by Kass, 
Tierney, and Kadane (1989) and Wong and Li (1992). Geweke (1989) discusses modal 
approximations for importance sampling and proposes the k-variate split normal density as 
an improved approximation for asymmetric posterior densities. 

The EM algorithm was first presented in full generality and under that name, along with 
many examples, by Dempster, Laird, and Rubin (1977); the formulation in that article is 
in terms of finding the maximum likelihood estimate, but, as the authors note, the same 
arguments hold for finding posterior modes. That article and the accompanying discussion 
contributions also refer to many earlier implementations in specific problems; see also Meng 
and Pedlow (1992). EM was first presented in a general statistical context by Orchard and 
Woodbury (1972) as the ‘missing information principle’ and first derived in mathematical 
generality by Baum et al. (1970). Little and Rubin (2002, Chapter 8) discuss the EM 
algorithm for missing data problems. SEM was introduced in Meng and Rubin (1991); ECM 
in Meng and Rubin (1993) and Meng (1994a); SECM in van Dyk, Meng, and Rubin (1995); 
and ECME in Liu and Rubin (1994). AECM appears in Meng and van Dyk (1997), and 
the accompanying discussion provides further connections. Many of the iterative simulation 
methods discussed in Chapter 11 for simulating posterior distributions can be regarded as 
stochastic extensions of EM; Tanner and Wong (1987) is an important paper in drawing 
this connection. Parameter-expanded EM was introduced by Liu, Rubin, and Wu (1998), 
and related ideas appear in Meng and van Dyk (1997), Liu and Wu (1999), and Liu (2003). 

Some references on variational Bayes include Jordan et al. (1999), Jaakkola and Jordan 
(2000), Blei, Ng, and Jordan (2003), and Gershman, Hoffman, and Blei (2013). Hoffman et 
al. (2013) present a stochastic variational algorithm that is computable for large datasets. 

Expectation propagation comes from Minka (2001). This and other deterministic ap- 
proximate Bayesian methods are reviewed by Bishop (2006) and Rasmussen and Williams 
(2006). Cseke and Heskes (2011) consider several methods to improve marginal posteri- 
ors obtained from Laplace’s method or expectation propagation. Jylanki, Nummenmaa, 
and Vehtari present expectation propagation framework which can be used for hierarchical 
generalized linear models. Rue, Martino, and Chopin (2009) describe integrated nested 
Laplace approximation and CCD integration, with more information at Rue (2013). Heskes 
et al. and Marin et al. (2012) review approximate Bayesian computation. Some work on 
substitution likelihoods appears in Dunson and Taylor (2005), Hoff (2007), and Murray et 
al. (2013). 

Bridge sampling was introduced by Meng and Wong (1996). Gelman and Meng (1998) 
generalize from bridge sampling to path sampling and provide references to related work 
that has appeared in the statistical physics literature. Meng and Schilling (1996) provide an 
example in which several of these methods are applied to a problem in factor analysis. Kong 
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et al. (2003) set up a general theoretical framework that includes importance sampling and 
bridge sampling as special cases. 

The method of computing normalizing constants for statistical problems using impor- 
tance sampling has been applied by Ott (1979) and others. Models with unknown normal- 
izing functions arise often in spatial statistics; see, for example, Besag (1974) and Ripley 
(1981, 1988). Geyer (1991) and Geyer and Thompson (1992, 1993) develop the idea of 
estimating the normalizing function using simulations from the model and have applied 
these methods to problems in genetics. Pettitt, Friel, and Reeves (2003) use path sampling 
to estimate normalizing constants for a class of models in spatial statistics. Computing 
normalizing functions is an area of active current research, as more and more complicated 
Bayesian models are coming into use. 


13.12 Exercises 


1. Multimodality: Consider a simple one-parameter model of independent data, y; ~ 
Cauchy(6,1),i =1,...,n, with uniform prior density on 6. Suppose n = 2. 
(a) Prove that the posterior distribution is proper. 
(b) Under what conditions will the posterior density be unimodal? 
2. Normal approximation and importance resampling: 


(a) Repeat Exercise 3.12 using the normal approximation to produce posterior simulations 
for (a, 8). 

(b) Use importance resampling to improve on the normal approximation. 

(c) Compute the importance ratios for your simulations. Plot a histogram of the impor- 
tance ratios and comment on their distribution. Compute an estimate of effective 
sample size using (10.4) on page 266. 

3. Mode-based approximation: Consider the model, y; ~ Binomial(n,;,9;), where 0; = 
logit” (a+ B25), for j =1,..., J, and with independent prior distributions, a ~ t4(0, 2?) 
and 6 ~ t4(0,1). Suppose J = 10, the x; values are randomly drawn from a U(0, 1) 
distribution, and nj ~ Poisson (5), where Poisson* is the Poisson distribution restricted 
to positive values. 

(a) Sample a dataset at random from the model 

(b) Use rejection sampling to get 1000 independent posterior draws from (a, 8). 

(c) Approximate the posterior density for (a, 3) by a normal centered at the posterior 
mode with covariance matrix fit to the curvature at the mode. 

(d) Take 1000 draws from the two-dimensional t4 distribution with that center and scale 
matrix and use importance sampling to estimate E(aly) and E((|y). 


4. Analytic approximation to a subset of the parameters: suppose that the joint posterior 
distribution p(01, @2|y) is of interest and that it is known that the t provides an adequate 
approximation to the conditional distribution, p(01|02, y). Show that both the normal 
and t approaches described in the last paragraph of Section 13.5 lead to the same answer. 


5. Estimating the number of unseen species (see Fisher, Corbet, and Williams, 1943, Efron 
and Thisted, 1976, and Seber, 1992): suppose that during an animal trapping expedition 
the number of times an animal from species į is caught is x; ~ Poisson(\;). For parts 
(a)-(d) of this problem, assume a Gamma(a, 3) prior distribution for the ,;’s, with a 
uniform hyperprior distribution on (a, 3). The only observed data are yx, the number of 
species observed exactly k times during a trapping expedition, for k = 1,2,3,... 

(a) Write the distribution p(x;|a, 8). 
(b) Use the distribution of x; to derive a multinomial distribution for y given that there 
are a total of N species. 
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(c) Suppose that we are given y = (118, 74, 44, 24, 29, 22, 20, 14, 20, 15, 12, 14, 6, 12, 6, 
9, 9, 6, 10, 10, 11, 5, 3, 3), so that 118 species were observed only once, 74 species were 
observed twice, and so forth, with a total of 496 species observed and 3266 animals 
caught. Write down the likelihood for y using the multinomial distribution with 24 
cells (ignoring unseen species). Use any method to find the mode of a, and an 
approximate second derivative matrix. 

(d) Derive an estimate and approximate 95% posterior interval for the number of addi- 
tional species that would be observed if 10,000 more animals were caught. 

(e) Evaluate the fit of the model to the data using appropriate posterior predictive checks. 

(£) Discuss the sensitivity of the inference in (d) to each of the model assumptions. 


6. Derivation of the monotone convergence of EM algorithm: prove that the function 
Eoia log p(y|¢, y) in (13.5) is maximized at ¢ = ¢°¢. (Hint: express the expectation 
as an integral and apply Jensen’s inequality to the convex logarithm function.) 

7. Conditional maximization for the hierarchical normal model: show that the conditional 
modes of ø and 7 associated with (11.14) and (11.16), respectively, are correct. 

8. Joint posterior modes for hierarchical models: 

(a) Show that the posterior density for the coagulation example from Table 11.2 on page 
288 has a degenerate mode at 7 = 0 and 0; = p for all j. 

(b) The rest of this exercise demonstrates that the degenerate mode represents a small 
part of the posterior distribution. First estimate an upper bound on the integral 
of the unnormalized posterior density in the neighborhood of the degenerate mode. 
(Approximate the integrand so that the integral is analytically tractable.) 

(c) Now approximate the integral of the unnormalized posterior density in the neighbor- 
hood of the other mode using the density at the mode and the second derivative matrix 
of the log posterior density at the mode. 

(d) Finally, estimate an upper bound on the posterior mass in the neighborhood of the 
degenerate mode. 


9. EM algorithm: 


(a) For the hierarchical normal model in Section 13.6, derive the expressions (13.14) for 
pee”, GON, and Trew, 
(b) Pick values for the hyperparameters in this model, then simulate fake data, then apply 


EM to estimate the model. Compare the EM estimate to the assumed true model. 
10. Variational Bayes: Consider probit regression, which is just like logistic except that the 
function logit~* is replaced by the normal cumulative distribution function. Set up 
and program variational Bayes for a probit regression with two coefficients (that is, 
Pr(y; = 1) = (a + bzi), for i = 1,...,n), using the latent-data formulation (so that 
zi ~ N(a + ba;,1) and y; = 1 if z;>0 and 0 otherwise): 
(a) Write the log posterior density (up to an arbitrary constant), p(a, b, z|y). 
(b) Assuming a variational approximation g that is independent in its n + 2 dimensions, 
determine the functional form of each of the factors in g. 
(c) Write the steps of the variational Bayes algorithm and program them in R. 
11. Unknown normalizing functions: compute the normalizing factor for the following un- 
normalized sampling density, 


p(ylu, A, B,C) x exp Law u)> + B(y — w)*+C(y—p)?)|, 


as a function of A, B,C. (Hint: it will help to integrate out analytically as many of the 
parameters as you can.) 
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Part IV: Regression Models 


With modern computational tools at our disposal, we now turn to linear regression and 
generalized linear models, which are the statistical methods most commonly used to un- 
derstand the relations between variables. Chapter 14 reviews classical regression from a 
Bayesian context, then Chapters 15 and 16 consider hierarchical linear regression and gen- 
eralized linear models, along with the analysis of variance. Chapter 17 discusses robust 
alternatives to the standard normal, binomial, and Poisson distributions, and Chapter 18 
discusses imputation of missing data. 
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Chapter 14 


Introduction to regression models 


Linear regression is one of the most widely used statistical tools. This chapter introduces 
Bayesian model building and inference for normal linear models, focusing on the simple case 
of uniform prior distributions. We apply the hierarchical modeling ideas of Chapter 5 in 
the context of linear regression in the next chapter. The analysis of the educational testing 
experiments in Chapter 5 is a special case of hierarchical linear modeling. 

The topics of setting up and checking linear regression models are far too broad to be 
adequately covered in one or two chapters here. Rather than attempt a complete treatment 
in this book, we cover the standard forms of regression in enough detail to show how to set up 
the relevant Bayesian models and draw samples from posterior distributions for parameters 
0 and future observables y. For the simplest case of linear regression, we derive the basic 
results in Section 14.2 and discuss the major applied issues in Section 14.3 with an extended 
example of estimating the effect of incumbency in elections. In the later sections of this 
chapter, we discuss analytical and computational methods for more complicated models. 

Throughout, we describe computations that build on the methods of standard least- 
squares regression where possible. In particular, we show how simple simulation methods 
can be used to (1) draw samples from posterior and predictive distributions, automatically 
incorporating uncertainty in the model parameters, and (2) draw samples for posterior 
predictive checks. 


14.1 Conditional modeling 


Many studies concern relations among two or more observables. A common question is: 
how does one quantity, y, vary as a function of another quantity or vector of quantities, x? 
In general, we are interested in the conditional distribution of y, given x, parameterized as 
p(y, x), under a model in which the n observations (x,y); are exchangeable. 


Notation 


The quantity of primary interest, y, is called the response or outcome variable; we assume 
here that it is continuous. The variables x = (a ,...,2%) are called explanatory variables 
and may be discrete or continuous. We sometimes choose a single variable x; of primary 
interest and call it the treatment variable, labeling the other components of x as control 
variables. The distribution of y given x is typically studied in the context of a set of 
units or experimental subjects, i = 1,...,n, on which y; and z£i1,..., Zik are measured. 
Throughout, we use 7 to index units and j to index components of x. We use y to denote 
the vector of outcomes for the n subjects and X as the n x k matrix of predictors. 

The simplest and most widely used version of this model is the normal linear model, in 
which the distribution of y given X is normal with a mean that is a linear function of X: 


E(yi|b, X) = Bites +o + BkTik, 
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for i = 1,...,n. For many applications, the variable x; is fixed at 1, so that (ia, is 
constant for all i. 

In this chapter, we restrict our attention to the normal linear model; in Sections 14.2— 
14.5, we further restrict to the case of ordinary linear regression in which the conditional 
variances are equal, var(y;|0, X) = ø? for all i, and the observations y; are conditionally 
independent given 0, X. The parameter vector is then 6 = ({1,...,(8%,0). We consider 
more complicated variance structures in Section 14.7. 

In the normal linear model framework, the key statistical modeling issues are (1) defining 
the variables x and y (possibly using transformations) so that the conditional expectation 
of y is reasonably linear as a function of the columns of X with approximately normal 
errors, and (2) setting up a prior distribution on the model parameters that accurately 
reflects substantive knowledge—a prior distribution that is sufficiently strong for the model 
parameters to be accurately estimated from the data at hand, yet not so strong as to 
dominate the data inappropriately. The statistical inference problem is to estimate the 
parameters 0, conditional on X and y. 

Because we can choose as many variables X as we like and transform the X and y 
variables in any convenient way, the normal linear model is a remarkably flexible tool for 
quantifying relationships among variables. In Chapter 16, we discuss generalized linear 
models, which broaden the range of problems to which the linear predictor can be applied. 


Formal Bayesian justification of conditional modeling 


The numerical ‘data’ in a regression problem include both X and y. Thus, a full Bayesian 
model includes a distribution for X, p(X|w), indexed by a parameter vector w, and thus 
involves a joint likelihood, p(X,y|v,@), along with a prior distribution, p(w,@). In the 
standard regression context, the distribution of X is assumed to provide no information 
about the conditional distribution of y given X; that is, we assume prior independence of 
the parameters 0 determining p(y|X, 0) and the parameters 7 determining p(X|w). 

Thus, from a Bayesian perspective, the defining characteristic of a ‘regression model’ 
is that it ignores the information supplied by X about (w,6). How can this be justified? 
Suppose w and @ are independent in their prior distribution; that is, p(w,@) = p(w)p(@). 
Then the posterior distribution factors, 


Pp, OX, y) = p(X) p(X, y), 


and we can analyze the second factor by itself (that is, as a standard regression model), 
with no loss of information: 
p(A|X,y) x p(A)p(y|X, 0). 


When the explanatory variables X are chosen (for example, in a designed experiment), their 
probability p(X) is known, and there are no parameters w. 

The practical advantage of using such a regression model is that it is much easier to spec- 
ify a realistic conditional distribution of one variable given k others than a joint distribution 
on all k+1 variables. 


14.2 Bayesian analysis of classical regression 


A large part of applied statistical analysis is based on linear regression techniques that can 
be thought of as Bayesian posterior inference based on a noninformative prior distribution 
for the parameters of the normal linear model. In Sections 14.2—14.5, we briefly outline, from 
a Bayesian perspective, the choices involved in setting up a regression model; these issues 
also apply to methods such as the analysis of variance and the analysis of covariance that 
can be considered special cases of linear regression. For more discussions of these issues from 
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a non-Bayesian perspective, see any standard regression or econometrics textbook. Under 
a standard noninformative prior distribution, the Bayesian estimates and standard errors 
coincide with the classical results. However, even in the noninformative case, posterior 
simulations are useful for predictive inference and model checking. 


Notation and basic model 


In the simplest case, sometimes called ordinary linear regression, the observation errors are 
independent and have equal variance; in vector notation, 


y|B,0,X ~ N(X8, oT), (14.1) 


where I is the n x n identity matrix. We discuss departures from the assumptions of 
the ordinary linear regression model—notably, the constant variance and zero conditional 
correlations in (14.1)—in Section 14.7. 


The standard noninformative prior distribution 


In the normal regression model, a convenient noninformative prior distribution is uniform 
on (8, logo) or, equivalently, 
p(B,07|X) «a. (14.2) 


When there are many data points and only a few parameters, the noninformative prior 
distribution is useful—it gives acceptable results (for reasons discussed in Chapter 4) and 
takes less effort than specifying prior knowledge in probabilistic form. For a small sample 
size or a large number of parameters, the likelihood is less sharply peaked, and so prior 
distributions and hierarchical models are more important. We return to the issue of prior 
information for the normal linear model in Section 14.8 and Chapter 15. 


The posterior distribution 


As with the normal distribution with unknown mean and variance analyzed in Chapter 3, 
we determine first the posterior distribution for 6, conditional on g, and then the marginal 
posterior distribution for 77. That is, we factor the joint posterior distribution for 8 and o? 
as p(8,07|y) = p(Blo?, y)p(o?|y). For notational convenience, we suppress the dependence 
on X here and in subsequent notation. 

Conditional posterior distribution of B, given a. The conditional posterior distribution of 
the (vector) parameter 3, given ø, is the exponential of a quadratic form in 8 and hence is 
normal. We use the notation 


Blo,y ~ N(B, Vgo?), (14.3) 

using the now familiar technique of completing the square (see Exercise 14.3): 
Ê = (XTX) 1XTy, (14.4) 
Ve = (XTX. (14.5) 


Marginal posterior distribution of @?. The marginal posterior distribution of a? can be 


written as 2 
p(B, o" ly 

poly) =, 

p(Blo?, y) 


which can be seen to have a scaled inverse-x? form (see Exercise 14.4), 


o*ly ~ Inv-x?(n — k, s”), (14.6) 
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where 


= (y— XB)" (y — X8). (14.7) 


The marginal posterior distribution of |y, averaging over ø, is multivariate t with n—k 
degrees of freedom, but we rarely use this fact in practice when drawing inferences by 
simulation, since to characterize the joint posterior distribution we can draw simulations of 
o and then {|c. 


Comparison to classical regression estimates. The standard non-Bayesian estimates of 8 
and o are p and s, respectively, as just defined. The classical standard error estimate for 8 
is obtained by setting o = s in (14.3). 


Checking that the posterior distribution is proper. As for any analysis based on an improper 
prior distribution, it is important to check that the posterior distribution is proper (that is, 
has a finite integral). It turns out that p(G,07|y) is proper as long as (1) n > k and (2) the 
rank of X equals k (see Exercise 14.6). Statistically, in the absence of prior information, 
the first condition requires that there be at least as many data points as parameters, and 
the second condition requires that the columns of X be linearly independent (that is, no 
column can be expressed as a linear combination of the other columns) in order for all k 
coefficients of 6 to be uniquely identified by the data. 


Sampling from the posterior distribution 


It is easy to draw samples from the posterior distribution, p(6,07|y), by (1) computing B 
from (14.4) and Vg from (14.5), (2) computing s? from (14.7), (3) drawing o? from the scaled 
inverse-x distribution (14.6), and (4) drawing 8 from the multivariate normal distribution 
(14.3). In practice, B and Vg can be computed using standard linear regression software. 


To be computationally efficient, the simulation can be set up as follows, using standard 
matrix computations. (See the bibliographic note at the end of the chapter for references on 
matrix factorization and least squares computation.) Computational efficiency is important 
for large datasets and also with the iterative methods required to estimate several variance 
parameters simultaneously, as described in Section 14.7. 


1. Compute the QR factorization, X = QR, where Q is an n x k matrix of orthonormal 
columns and R is a k x k upper triangular matrix. 


2. Compute R~'—this is an easy task since R is upper triangular. RT! is a Cholesky 
factor (that is, a matrix square root) of the covariance matrix V3, since R-1(R~1)7 = 


(XTX)! = V3. 


3. Compute B by solving the linear system, RB = QTy, using the fact that R is upper 
triangular. 


Once o? is simulated (using the random y? draw), 8 can be easily simulated from the ap- 
propriate multivariate normal distribution using the Cholesky factorization and a program 
for generating independent standard normals (see Appendix A). The QR factorization of 
X is useful both for computing the mean of the posterior distribution and for simulating 
the random component in the posterior distribution of (. 


For some large problems involving thousands of data points and hundreds of explanatory 
variables, even the QR decomposition can require substantial computer storage space and 
time, and methods such as conjugate gradient, stepwise ascent, and iterative simulation can 
be more effective. 
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The posterior predictive distribution for new data 


Now suppose we apply the regression model to a new set of data, for which we have observed 
the matrix X of explanatory variables, and we wish to predict the outcomes, y. If 6 and 
a? were known exactly, the vector 7 would have a normal distribution with mean X B and 
variance matrix o?J. Instead, our current knowledge of 6 and o is summarized by our 
posterior distribution. 


Posterior predictive simulation. The posterior predictive distribution of unobserved data, 
p(yly), has two components of uncertainty: (1) the fundamental variability of the model, 
represented by the variance o? in y not accounted for by X£, and (2) the posterior uncer- 
tainty in 6 and o due to the finite sample size of y. (Our notation continues to suppress 
the dependence on X and X .) As the sample size n — oo, the variance due to posterior 
uncertainty in (3,07) decreases to zero, but the predictive uncertainty remains. To draw a 
random sample ğ from its posterior predictive distribution, we first draw (3,0) from their 
joint posterior distribution, then draw J ~ N(XB, oI). 


Analytic form of the posterior predictive distribution. The normal linear model is sim- 
ple enough that we can also determine the posterior predictive distribution analytically. 
Deriving the analytic form is not necessary—we can easily draw (3,0) and then g, as de- 
scribed above—however, we can gain useful insight by studying the predictive uncertainty 
analytically. 

We first consider the conditional posterior predictive distribution, p(g|c, y), then average 
over the posterior uncertainty in oly. Given ø, the future observation ğ has a normal 
distribution (see Exercise 14.7), and we derive its mean by averaging over 3 using (2.7): 


Eloy) = EEJ, o, y)lo, y) 
E(X Alo, y) 


where the inner expectation averages over y, conditional on 6, and the outer expectation 
averages over 3. All expressions are conditional on ø and y, and the conditioning on X and 
X is implicit. Similarly, we can derive var(g|o, y) using (2.8): 


var(ylo,y) = Elvar(y|8,0,y)|o,y] + var[E(G|8, 0, y)|o, y] (14.8) 
= Elo7J\o,y] + var[X Alo, y] 
= (I+ XVgX7)o?. (14.9) 


This result makes sense: conditional on a, the posterior predictive variance has two terms: 
oI, representing sampling variation, and XVgX7o?, due to uncertainty about £. 

Given øg, the future observations have a normal distribution with mean X Ê, which 
does not depend on g, and variance (14.9) that is proportional to 07. To complete the 
determination of the posterior predictive distribution, we must average over the marginal 
posterior distribution of o? in (14.6). The resulting posterior predictive distribution, p(jly), 
is multivariate t with center XÔ, squared scale matrix s? (I+ X VX T), and n — k degrees 
of freedom. 


Prediction when X is not completely observed. It is harder to predict y if not all the ex- 
planatory variables in X are known, because then the explanatory variables must themselves 
be modeled by a probability distribution. We return to the problem of multivariate missing 
data in Chapter 18. 
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Model checking and robustness 


Checking the fit and robustness of a linear regression model is a well-developed topic in 
statistics. The standard methods such as examining plots of residuals against explanatory 
variables are useful and can be directly interpreted as posterior predictive checks. An 
advantage of the Bayesian approach is that we can compute, using simulation, the posterior 
predictive distribution for any data summary, so we do not need to put a lot of effort into 
estimating the sampling distributions of test statistics. For example, to assess the statistical 
and practical significance of patterns in a residual plot, we can obtain the posterior predictive 
distribution of an appropriate test statistic (for example, the correlation between the squared 
residuals and the fitted values), as we illustrate in Table 14.2 in the following example. 


14.3 Regression for causal inference: incumbency and voting 


We illustrate the Bayesian interpretation of linear regression with an example of construct- 
ing a regression model using substantive knowledge, computing its posterior distribution, 
interpreting the results, and checking the fit of the model to data. 

Observers of legislative elections in the United States have often noted that incumbency— 
that is, being the current representative in a district—is an advantage for candidates. Po- 
litical scientists are interested in the magnitude of the effect, formulated, for example, as 
‘what proportion of the vote is incumbency worth?’ and ‘how has incumbency advantage 
changed over the past few decades?’ We shall use linear regression to study the advantage 
of incumbency in elections for the U.S. House of Representatives in the past century. In 
order to assess changes over time, we run a separate regression for each election year in our 
study. The results of each regression can be thought of as summary statistics for the effect 
of incumbency in each election year; these summary statistics can themselves be analyzed, 
formally by a hierarchical time series model or, as we do here, informally by examining 
graphs of the estimated effect and standard errors over time. 

Every two years, the members of the U.S. House of Representatives are elected by 
plurality vote in 435 single-member districts. Typically, about 100 to 150 of the district 
elections are uncontested; that is, one candidate runs unopposed. Almost all the other 
district elections are contested by one candidate from each of the two major parties, the 
Democrats and the Republicans. In each district, one of the parties—the incumbent party 
in that district—currently holds the seat in the House, and the current officeholder—the 
incumbent—may or may not be a candidate for reelection. We are interested in the effect of 
the decision of the incumbent to run for reelection on the vote received by the incumbent 
party’s candidate. 


Units of analysis, outcome, and treatment variables 


For each election year, the units of analysis in our study are the contested district elections. 
The outcome variable, y;, is the proportion of the vote received by the incumbent party (see 
below) in district i, and we code the treatment variable as R;, the decision of the incumbent 
to run for reelection: 


R= 1 if the incumbent officeholder runs for reelection 
‘“) 0 otherwise. 


If the incumbent does not run for reelection (that is, if R; = 0), the district election is called 
an ‘open seat.’ Thus, an incumbency advantage would cause the value of y; to increase if 
Ri = 1. We exclude from our analysis votes for third-party candidates and elections in 
which only one major-party candidate is running; see the bibliographic note for references 
discussing these and other data-preparation issues. 
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Figure 14.1 U.S. congressional elections: Democratic proportion of the vote in contested districts in 
1986 and 1988. Dots and circles indicate districts that in 1988 had incumbents running and open 
seats, respectively. Points on the left and right halves of the graph correspond to the incumbent 
party being Republican or Democratic. 


We analyze the data as an observational study in which we are interested in estimating 
the effect of the incumbency variable on the vote proportion. The estimand of primary 
interest in this study is the average effect of incumbency. 

We define the theoretical incumbency advantage for an election in a single district 7 as 


: I O 
incumbency advantage; = Ycompletes — Ycompletei (14.10) 


where 


Yiampietéi = proportion of the vote in district ¢ received by the incumbent legislator, 
if he or she runs for reelection against major-party opposition in district i (thus, 
U cies į 18 unobserved in an open-seat election), 

U apite = proportion of the vote in district 7 received by the incumbent party, if 
the incumbent legislator does not run and the two major parties compete for the open 
seat (thus, act į is unobserved if the incumbent runs for reelection). 


The observed outcome, y;, equals either UE ics j; OF aise depending on whether the 
treatment variable equals 0 or 1. 

We define the aggregate incumbency advantage for an entire legislature as the average of 
the incumbency advantages for all districts in a general election. This theoretical definition 
applies within a single election year and allows incumbency advantage to vary among dis- 
tricts. The definition (14.10) does not assume that the candidates under the two treatments 
are identical in all respects except for incumbency status. 

The incumbency advantage in a district depends on both Yoni į and Y etic ¡j un- 
fortunately, a real election in a single district will reveal only one of these. The problem 
can be thought of as causal inference from an observational study; as discussed in Sections 
8.4 and 8.6, the average treatment effect can be estimated using regression, if we condition 
on enough control variables for the treatment assignment to be considered ignorable. 


Setting up control variables so that data collection is approximately ignorable 


It would be possible to estimate the incumbency advantage with only two columns in X: 
the treatment variable and the constant term (a column of ones). The regression would 
then be directly comparing the vote shares of incumbents to nonincumbents. The problem 
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is that, since incumbency is not a randomly assigned experimental treatment, incumbents 
and nonincumbents no doubt differ in important ways other than incumbency. For example, 
suppose that incumbents tend to run for reelection in ‘safe seats’ that favor their party, but 
typically decline to run for reelection in ‘marginal seats’ that they have less chance of 
winning. If this were the case, then incumbents would be getting higher vote shares than 
non-incumbents, even in the absence of incumbency advantage. The resulting inference for 
incumbency advantage would be flawed because of serious nonignorability in the treatment 
assignment. 

A partial solution is to include the vote for the incumbent party in the previous election 
as a control variable. Figure 14.1 shows the data for the 1988 election, using the 1986 
election as a control variable. Each symbol in Figure 14.1 represents a district election; 
the dots represent districts in which an incumbent is running for reelection, and the open 
circles represent ‘open seats’ in which no incumbent is running. The vertical coordinate of 
each point is the share of the vote received by the Democratic party in the district, and 
the horizontal coordinate is the share of the vote received by the Democrats in the previous 
election in that district. The strong correlation confirms both the importance of using the 
previous election outcome as a control variable and the rough linear relation between the 
explanatory and outcome variables. 

We include another control variable for the incumbent party: P; = 1 if the Democrats 
control the seat and —1 if the Republicans control the seat before the election, whether or 
not the incumbent is running for reelection. This includes in the model a possible nationwide 
partisan swing; for example, a swing of 5% toward the Democrats would add 5% to y; for 
districts 7 in which the Democrats are the incumbent party and —5% to y; for districts 7 in 
which the Republicans are the incumbent party. 

It might make sense to include other control variables that may affect the treatment 
and outcome variables, such as incumbency status in that district in the previous election, 
the outcome in the district two elections earlier, and so forth. At some point, additional 
variables will add little to the ability of the regression model to predict y and will have 
essentially no influence on the coefficient for the treatment variable. 


Implicit ignorability assumption 


For our regression to estimate the actual effect of incumbency, we are implicitly assuming 
that the treatment assignments—the decision of an incumbent political party to run an 
incumbent or not in district 7, and thus to set R; equal to 0 or 1—conditional on the control 
variables, do not depend on any other variables that also affect the election outcome. For 
example, if incumbents who knew they would lose decided not to run for reelection, then the 
decision to run would depend on an unobserved outcome, the treatment assignment would 
be nonignorable, and the selection effect would have to be modeled. 

In a separate analysis of these data, we have found that the probability an incumbent 
runs for reelection is approximately independent of the vote in that district in the previous 
election. If electoral vulnerability were a large factor in the decision to run, we would expect 
that incumbents with low victory margins in the previous election would be less likely to 
run for reelection. Since this does not occur, we believe the departures from ignorability 
are small. So, although the ignorability assumption is imperfect, we tentatively accept 
it for this analysis. The decision to accept ignorability is made based on subject-matter 
knowledge and additional data analysis. 


Transformations 


Since the explanatory variable is restricted to lie between 0 and 1 (recall that we have 
excluded uncontested elections from our analysis), it would seem advisable to transform the 
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Figure 14.2 Incumbency advantage over time: posterior median and 95% interval for each election 
year. The inference for each year is based on a separate regression. As an example, the results 
from the regression for 1988, based on the data in Figure 14.1, are displayed in Table 14.1. 


data, perhaps using the logit transformation, before fitting a linear regression model. In 
practice, however, almost all the vote proportions y; fall between 0.2 and 0.8, so the effect 
of such a transformation would be minor. We analyze the data on the original scale for 
simplicity in computation and interpretation of inferences. 


Posterior inference 


As an initial analysis, we estimate separate regressions for each of the election years in the 
twentieth century, excluding election years immediately following redrawing of the district 
boundaries, for it is difficult to define incumbency in those years. Posterior means and 
95% posterior intervals (determined analytically from the appropriate t distributions) for 
the coefficient for incumbency are displayed for each election year in Figure 14.2. As usual, 
we can use posterior simulations of the regression coefficients to compute any quantity of 
interest. For example, the increase from the average incumbency advantage in the 1950s 
to the average advantage in the 1980s has a posterior mean of 0.050 with a central 95% 
posterior interval of [0.035, 0.065], according to an estimate based on 1000 independent 
simulation draws. 

These results are based using incumbent party and previous election result as control 
variables (in addition to the constant term). Including more control variables to account 
for earlier incumbency and election results did not substantially change the inference about 
the coefficient of the treatment variable, and in addition made the analysis more difficult 
because of complications such as previous elections that were uncontested. 

As an example of the results from a single regression, Table 14.1 displays posterior in- 
ferences for the coefficients @ and residual standard deviation ø of the regression estimating 
the incumbency advantage in 1988, based on a noninformative uniform prior distribution 
on (3,log a) and the data displayed in Figure 14.1. The posterior quantiles could have been 
computed by simulation, but for this simple case we computed them analytically from the 
posterior t and scaled inverse-,? distributions. 


Model checking and sensitivity analysis 


The estimates in Figure 14.2 are plausible and also add to our understanding of elections, 
in giving an estimate of the magnitude of the incumbency advantage (‘what proportion of 
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Variable Posterior quantiles 

2.5% 25% median 75% 97.5% 
Incumbency 0.084 0.103 0.114 0.124 0.144 
Vote proportion in 1986 0.576 0.627 0.654 0.680 0.731 
Incumbent party —0.014 —0.009 —0.007 —0.004 0.001 
Constant term 0.066 0.106 0.127 0.148 0.188 
o (residual sd) 0.061 0.064 0.066 0.068 0.071 


Table 14.1 Inferences for parameters in the regression estimating the incumbency advantage in 
1988. The outcome variable is the incumbent party’s share of the two-party vote in 1988, and only 
districts that were contested by both parties in both 1986 and 1988 were included. The parameter of 
interest is the coefficient of incumbency. Data are displayed in Figure 14.1. The posterior median 
and 95% interval for the coefficient of incumbency correspond to the bar for 1988 in Figure 14.2. 


Standardized residual 


0.2 0.4 0.6 0.8 
Democratic vote in previous election 
Figure 14.3 Standardized residuals, (yit —Xit8)/se, from the incumbency advantage regressions for 


the 1980s, vs. Democratic vote in the previous election. (The subscript t indexes the election years. ) 
Dots and circles indicate district elections with incumbents running and open seats, respectively. 


the vote is incumbency worth?’) and evidence of a small positive incumbency advantage in 
the first half of the century. 


In addition it is instructive, and crucial if we are to have any faith in our results, to 
check the fit of the model to our data. 


Search for outliers. A careful look at Figure 14.1 suggests that the outcome variable is not 
normally distributed, even after controlling for its linear regression on the treatment and 
control variables. To examine the outliers further, we plot in Figure 14.3 the standardized 
residuals from the regressions from the 1980s. As in Figure 14.1, elections with incumbents 
and open seats are indicated by dots and circles, respectively. (We show data from just one 
decade because displaying the points from all the elections in our data would overwhelm the 
scatterplot.) For the standardized residual for the data point i, we just use (yi — X;8)/s, 
where s is the estimated standard deviation from equation (14.7). For simplicity, we still 
have a separate regression, and thus separate values of B and s, for each election year. 
If the normal linear model is correct, the standardized residuals should be approximately 
normally distributed, with a mean of 0 and standard deviation of about 1. Some of the 
standardized residuals in Figure 14.3 appear to be outliers by comparison to the normal 
distribution. (The residual standard deviations of the regressions are about 0.07—see Table 
14.1, for example—and almost all of the vote shares lie between 0.2 and 0.8, so the fact 
that the vote shares are bounded between 0 and 1 is essentially irrelevant here.) 
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Observed Posterior predictive dist. 
proportion of proportion of outliers 
of outliers 2.5% median 97.5% 


Open seats 41/1596 = 0.0257 | 0.0013 0.0038 0.0069 
Incumbent running 84/10303 = 0.0082 | 0.0028 0.0041 0.0054 


Table 14.2 Summary of district elections that are ‘outliers’ (defined as having absolute (unstandard- 
ized) residuals from the regression model of more than 0.2) for the incumbency advantage example. 
Elections are classified as open seats or incumbent running; for each category, the observed propor- 
tion of outliers is compared to the posterior predictive distribution. Both observed proportions are 
far higher than expected under the model. 


Posterior predictive checks. 'To perform a more formal check, we compute the proportion 
of district elections over a period of decades whose unstandardized residuals from the fitted 
regression models are greater than 0.20 in absolute value, a value that is roughly 3 estimated 
standard deviations away from zero. We use this unconventional measure partly to demon- 
strate the flexibility of the Bayesian approach to posterior predictive model checking and 
partly because the definition has an easily understood political meaning—the proportion 
of elections mispredicted by more than 20%. The results classified by incumbency status 
appear in the first column of Table 14.2. 

As a comparison, we simulate the posterior predictive distribution of the test statistics: 


1. For 1 =1,...,1000: 
(a) For each election year in the study: 


i. Draw (6,0) from their posterior distribution. 

ii. Draw a hypothetical replication, y™°P, from the predictive distribution, y™°P ~ 
N(X £6,071), given the drawn values of (8,0) and the existing vector X for that 
election year. 

iii. Run a regression of y™°P on X and save the residuals. 


(b) Combine the results from the individual election years to get the proportion of residuals 
that exceed 0.2 in absolute value, for elections with and without incumbents running. 


2. Use the 1000 simulated values of the above test statistics to represent the posterior 
predictive distribution. 


Quantiles from the posterior predictive distributions of the test statistics are shown as the 
final three columns of Table 14.2. The observed numbers of outliers in the two categories 
are about ten times and twice the values expected under the model and can clearly not be 
explained by chance. 

One way to measure the seriousness of the outliers is to compute a test statistic mea- 
suring the effect on political predictions. For example, consider the number of party 
switches—districts in which the incumbent party candidate loses. In the actual data, 
1498/11899 = 0.126 of contested district elections result in a party switch in that district. 
By comparison, we can compute the posterior predictive distribution of the proportion of 
party switches using the same posterior predictive simulations as above; the median of the 
1000 simulations is 0.136 with a central 95% posterior interval of [0.130, 0.143]. The pos- 
terior predictive simulations tell us that the observed proportion of party switches is lower 
than could be predicted under the model, but the difference is a minor one of overpredicting 
switches by about one percentage point. 


Sensitivity of results to the normality assumption. These outliers do not strongly affect 
our inference for the average incumbency advantage, so we ignore this failing of the model. 
We would not want to use this model for predictions of extreme outcomes, however, nor 
would we be surprised by occasional election outcomes far from the regression line. In 
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political terms, the outliers may correspond to previously popular politicians who have been 
tainted by scandal—information that is not included in the model. It would be possible 
to generalize, for example by modeling the outliers with unequal variances for incumbents 
and open seats, along with a t error term; but we would want a good reason to add in this 
additional complexity. 


14.4 Goals of regression analysis 


Regression models satisfy at least three goals: (1) understanding the behavior of y, given 
x (for example, ‘what are the factors that aid a linear prediction of the Democratic share 
of the vote in a congressional district?’); (2) predicting y, given x, for future observations 
(‘what share of the vote might my local congressmember receive in the next election?’ or 
‘how many Democrats will be elected to Congress next year?’); (3) causal inference, or 
predicting how y would change if x were changed in a specified way (‘what would be the 
effect on the number of Democrats elected to Congress next year, if a term limitations bill 
were enacted—so no incumbents would be allowed to run for reelection—compared to if no 
term limitations bill were enacted?’). 

The goal of understanding how y varies as a function of x is clear, given any particular 
regression model. We discuss the goals of prediction and causal inference in more detail, 
focusing on how the general concepts can be implemented in the form of probability models, 
posterior inference, and prediction. 


Predicting y from x for new observations 


Once its parameters have been estimated, a regression can be used to predict future ob- 
servations from units in which the explanatory variables X, but not the outcome y, have 
been observed. When making predictions we are assuming that the old and new observa- 
tions are exchangeable given the same values of x, so that that the vector x contains all 
the information we have to distinguish the new observation from the old (this includes, for 
example, the assumption that time of observation is irrelevant if it is not encoded in zx). 
For example, suppose we have fitted a regression model to 100 schoolchildren, with y for 
each child being the reading test score at the end of the second grade and x having two 
components: the student’s test score at the beginning of the second grade and a constant 
term. Then we could use these predictors to construct a predictive distribution of y for a 
new student for whom we have observed z. This prediction would be most trustworthy if 
all 101 students were randomly sampled from a common population (such as students in a 
particular school or students in the United States as a whole) and less reliable if the addi- 
tional student differed in some known way from the first hundred—for example, if the first 
hundred came from a single school and the additional student attended a different school. 

As with exchangeability in general, it is not required that the 101 students be ‘identical’ 
or even similar, just that all relevant knowledge about them be included in x. The more 
similar the units are, the lower the variance of the regression will be, but that is an issue of 
precision, not validity. 

When the old and new observations are not exchangeable, the relevant information 
should be encoded in x. For example, if we are interested in learning about students from 
two different schools, we should include an indicator variable in the regression. The simplest 
approach is to replace the constant term in « by two indicator variables (that is, replacing 
the column of ones in X by two columns): x4 that equals 1 for students from school A and 
0 for students from school B, and zg that is the reverse. Now, if all the 100 students used to 
estimate the regression attended school A, then the data will provide no evidence about the 
coefficient for school B. The resulting predictive distribution of y for a new student in school 
B is highly dependent on our prior distribution, indicating our uncertainty in extrapolating 
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to a new population. (With a noninformative uniform prior distribution and no data on the 
coefficient of school B, the improper posterior predictive distribution for a new observation 
in school B will have infinite variance. In a real study, it should be possible to construct 
some sort of weak prior distribution linking the coefficients in the two schools, which would 
lead to a posterior distribution with high variance.) 


Causal inference 


The goals of describing the relationship between y and x and using the resulting model for 
prediction are straightforward applications of estimating p(y|x). Causal inference is more 
subtle. When thinking about causal inference, as in the incumbency advantage example of 
Section 14.3, we think of the variable of interest as the treatment variable and the other ex- 
planatory variables as control variables or covariates. In epidemiology, closely related terms 
are exposure and confounding variables, respectively. The treatment variables represent at- 
tributes that are manipulated or at least potentially manipulable by the investigator (such 
as the doses of drugs applied to a patient in an experimental medical treatment), whereas 
the control variables measure other characteristics of the experimental unit or experimental 
environment, such as the patient’s weight, measured before the treatment. 


Do not control for post-treatment variables when estimating the causal effect. 


Some care must be taken when considering control variables for causal inference. For in- 
stance, in the incumbency advantage example, what if we were to include a control variable 
for campaign spending, perhaps the logarithm of the number of dollars spent by the incum- 
bent candidate’s party in the election? After all, campaign spending is generally believed 
to have a large effect on many election outcomes. For the purposes of predicting election 
outcomes, it would be a good idea to include campaign spending as an explanatory variable. 
For the purpose of estimating the incumbency advantage with a regression, however, total 
campaign spending should not be included, because much spending occurs after the deci- 
sion of the incumbent whether to run for reelection. The causal effect of incumbency, as we 
have defined it, is not equivalent to the effect of the incumbent running versus not running, 
with total campaign spending held constant, since, if the incumbent runs, total campaign 
spending by the incumbent party will probably increase. Controlling for ‘pre-decision’ cam- 
paign spending would be legitimate, however. If we control for one of the effects of the 
treatment variable, our regression will probably underestimate the true causal effect. If we 
are interested in both predicting vote share and estimating the causal effect of incumbency, 
we could include campaign spending and vote share as correlated outcome variables. 


14.5 Assembling the matrix of explanatory variables 


The choice of which variables to include in a regression model depends on the purpose 
of the study. We discuss, from a Bayesian perspective, some issues that arise in classical 
regression. We have already discussed issues arising from the distinction between prediction 
and causal inference. 


Identifiability and collinearity 


The parameters in a classical regression cannot be uniquely estimated if there are more pa- 
rameters than data points or, more generally, if the columns of the matrix X of explanatory 
variables are not linearly independent. In these cases, the data are said to be ‘collinear,’ 
and 8 cannot be uniquely estimated from the data alone, no matter how large the sample 
size. Think of a k-dimensional scatterplot of the n data points: if the n points fall in a 
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lower-dimensional subspace (such as a two-dimensional plane sitting in a three-dimensional 
space), then the data are collinear (‘coplanar’ would be a better word). If the data are 
nearly collinear, falling close to some lower-dimensional hyperplane, then they supply little 
information about some linear combinations of the (’s. 


For example, consider the incumbency regression described in Section 14.3. If all the 
incumbents running for reelection had won with 70% of the vote in the previous election, 
and all the open seats occurred in districts in which the incumbent party won 60% of the 
vote in the previous election, then the three variables, incumbency, previous vote for the 
incumbent party, and the constant term, would be collinear (previous vote = 0.6 + 0.1R;), 
and it would be impossible to estimate the three coefficients from the data alone. To do 
better we need more data—or prior information—that do not fall along the plane. Now 
consider a hypothetical dataset that is nearly collinear: suppose all the candidates who had 
received more than 65% in the previous election always ran for reelection, whereas members 
who had won less than 65% always declined to run. The near-collinearity of the data means 
that the posterior variance of the regression coefficients would be high in this hypothetical 
case. Another problem in addition to increased uncertainty conditional on the regression 
model is that in practice the inferences would be highly sensitive to the model’s assumption 
that E(y|z, @) is linear in a. 


Nonlinear relations 


Once the variables have been selected, it often makes sense to transform them so that 
the relation between x and y is close to linear. Transformations such as logarithms and 
logits have been found useful in a variety of applications. One must take care, however: 
a transformation changes the interpretation of the regression coefficient to the change in 
transformed y per unit change in the transformed « variable. If it is thought that a variable 
x; has a nonlinear effect on y, it is also possible to include more than one transformation of 
x; in the regression—for example, including both x; and z? allows an arbitrary quadratic 
function to be fitted. In Chapters 20-23 we discuss nonparametric models that can track 
nonlinear relations without need to prespecify the functional form of the nonlinearity. When 
y is discrete, a generalized linear model can be appropriate; see Chapter 16. 


Indicator variables 


To include a categorical variable in a regression, a natural approach is to construct an 
‘indicator variable’ for each category. This allows a separate effect for each level of the 
category, without assuming any ordering or other structure on the categories. When there 
are two categories, a simple 0/1 indicator works; when there are k categories, k— 1 indicators 
are required in addition to the constant term. It is often useful to incorporate the coefficients 
of indicator variables into hierarchical models, as we discuss in Chapter 15. 


Categorical and continuous variables 


If there is a natural order to the categories of a discrete variable, then it is often useful to 
treat the variable as if it were continuous. For example, the letter grades A, B, C, D might 
be coded as 4, 3, 2, 1. In epidemiology this approach is often referred to as trend analysis. 
It is also possible to create a categorical variable from a continuous variable by grouping 
the values. This is sometimes helpful for examining and modeling departures from linearity 
in the relationship of y to a particular component of x. 
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Interactions 


In the linear model, a change of one unit in x; is associated with a constant change in the 
mean response of y;, given any fixed values of the other predictors. If the response to a 
unit change in x; depends on what value another predictor x; has been fixed at, then it 
is necessary to include interaction terms in the model. Generally the interaction can be 
allowed for by adding the cross-product term (x; — %)(xj; — Tj) as an additional predictor, 
although such terms may not be readily interpretable if both x; and x; are continuous 
(if such is the case, it is often preferable to categorize at least one of the two variables). 
For purposes of this exposition we treat these interactions just as we would any other 
explanatory variable: that is, create a new column of X and estimate a new element of 8. 
We discuss nonparametric models for interactions in Chapters 20-23. 


Controlling for irrelevant variables 


In addition, we generally wish to include only variables that have some reasonable substan- 
tive connection to the problem under study. Often in regression there are a large number 
of potential control variables, some of which may appear to have predictive value. For one 
example, consider, as a possible control variable in the incumbency example, the number of 
letters in the last name of the incumbent party’s candidate. On the face of it, this variable 
looks silly, but it might happen to have predictive value for our dataset. In almost all cases, 
length of name is determined before the decision to seek reelection, so it will not interfere 
with causal inference. However, if length of name has predictive value in the regression, 
we should try to understand what is happening, rather than blindly throwing it into the 
model. For example, if length of name is correlating with ethnic group, which has politi- 
cal implications, it would be better, if possible, to use the more substantively meaningful 
variable, ethnicity itself, in the final model. 


Selecting the explanatory variables 


Ideally, a statistical model should include all relevant information; in a regression, x should 
include all covariates that might possibly help predict y. The attempt to include relevant 
predictors is demanding in practice but is generally worth the effort. The possible loss of 
precision when including unimportant predictors is usually viewed as a relatively small price 
to pay for the general validity of predictions and inferences about estimands of interest. 

In classical regression, there are direct disadvantages to increasing the number of ex- 
planatory variables. For one thing there is the restriction, when using the noninformative 
prior distribution, that k < n. In addition, using a large number of explanatory variables 
leaves little information available to obtain precise estimates of the variance. These prob- 
lems, which are sometimes summarized by the label ‘overfitting,’ are of much less concern 
with reasonable prior distributions, such as those applied in hierarchical linear models, as 
we shall see in Chapter 15. We consider Bayesian approaches to handling a large number 
of predictor variables in Section 14.6. 


14.6 Regularization and dimension reduction 


Approaches such as stepwise regression and subset selection are traditional non-Bayesian 
methods for choosing a set of explanatory variables to include in a regression. Mathemat- 
ically, not including a variable is equivalent to setting its coefficient to exactly zero. The 
classical regression estimate based on a selection procedure is equivalent to obtaining the 
posterior mode corresponding to a prior distribution that has nonzero probability on various 
low-dimensional hyperplanes of 8-space. The selection procedure is heavily influenced by 
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the quantity of data available, so that important variables may be omitted because chance 
variation cannot be ruled out as an alternative explanation for their predictive power. Geo- 
metrically, if G-space is thought of as a room, the model implied by classical model selection 
claims that the true @ has certain prior probabilities of being in the room, on the floor, on 
the walls, in the edge of the room, or in a corner. 

In a Bayesian framework, it is appealing to include prior information more continuously. 
With many explanatory variables zj, each with a fair probability of being irrelevant to 
modeling the outcome variable y, one can give each coefficient a prior distribution p(3) 
with a peak at zero—here we are thinking not of a spike at zero but of a distribution such 
as a zero-centered t with a mode concentrated near zero—and a long tail. This says that 
each variable is probably unimportant, but if it has predictive power, it could be large. 

For example, when the coefficients of a set of predictors are themselves modeled, it may 
be important first to apply linear or other transformations to put the predictors on an ap- 
proximately common scale. In classical maximum likelihood or least squares estimation, or 
if a linear regression is performed with a noninformative uniform prior distribution on the 
coefficients j, linear transformations of the predictor variables have no effect on inferences 
or predictions. In a Bayesian setting, however, transformations—even linear transforma- 
tions in a linear model—can be important. 

For example, if several different economic measures were available in a forecasting prob- 
lem, it would be natural to first transforming each to an approximate —1 to 1 scale (say) 
before applying a prior distribution on the coefficients. Such transformations make intuitive 
sense statistically, and they allow a common prior distribution to be reasonable for a set 
of different coefficients j. From a purely subjectivist perspective, any such linear data 
transformation would be irrelevant because it would directly map into a transformation 
of the prior distribution. In practical Bayesian inference, however, one of our criteria for 
choosing models is convenience, and as a default it is convenient to use a single prior dis- 
tribution for all the coefficients in a model—or perhaps to categorize the coefficients into 
two or three categories (for example, corresponding to predictors that are essential, those 
that are pontetially important, and those of doubtful relevance). In either case, we would 
not be carefully assigning a separate prior distribution for each coefficient; rather, we would 
be using a flexible yet convenient class of models that we hope would perform well in most 
settings (the traditional goal of statistical methods, Bayesian and otherwise). 

As discussed in Section 2.8, regularization is a general term used for statistical procedures 
that give more stable estimates. Least-squares estimates with large numbers of predictors 
can be noisy; informative prior distributions regularize the estimates. Three choices are 
involved in Bayesian regularization: 


e The location and scale of the prior distribution: a more concentrated prior does more 
regularization. 


e The analytic form of the prior distribution: a normal distribution pulls estimates toward 
the prior mean by a constant proportion, a double-exponential (Laplacian) shifts esti- 
mates by a constant amount, and a long-tailed distribution such as a Cauchy does more 
regularization for estimates near the mean and less for estimates that are far away. 


e How the posterior inference is summarized: the posterior mode, which omits some vari- 
ation, can look smoother (that is, more regularized) than the full posterior. 


Lasso 


We can illustrate all three of these points with lasso, a popular non-Bayesian form of 
regularization that corresponds to estimating coefficients by their posterior mode, after 
assigning a double-exponential (Laplace) prior distribution centered at zero; that is, p(8) « 
I], exp(—A|5;|). Combined with the normal likelihood from the regression model, this 
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yields a posterior distribution that is partially pooled toward zero, with the amount of 
pooling determined by A (a hyperparameter that can be set based on external information 
or estimated from data, either Bayesianly or through some non-Bayesian approach such as 
cross-validation). The key to lasso, though, is the combination of a prior distribution with 
a sharp peak right at zero and the decision to summarize the posterior distribution (or, 
equivalently, the penalized likelihood function) by its mode. Putting these together allows 
the estimates for coefficients to become exactly zero in settings with many predictors, noisy 
data, and small to moderate sample sizes. 

For the regressions we typically see, we do not believe any coefficients to be truly zero 
and we do not generally consider it a conceptual (as opposed to computational) advantage to 
get point estimates of zero—but regularized estimates such as obtained by lasso can be much 
better than those resulting from simple least squares and flat prior distributions, which, in 
practice, can implicitly require users to massively restrict the set of possible predictors in 
order to obtain stable estimates. 

From this perspective, regularization will always happen, one way or another—and sim- 
ple methods such as lasso have two advantages over traditional approaches to selecting 
regression predictors. First, lasso is clearly defined and algorithmic; its choices are trans- 
parent, which is not the case with informal methods. Second, lasso allows the inclusion 
of more information in the model fitting and uses a data-based approach to decide which 
predictors to keep. 

From a Bayesian perspective, there is room for improving lasso. The obvious first step 
would be to allow uncertainty in which variables are selected, perhaps returning a set of 
simulations representing different subsets to include in the model. We do not go this route, 
however, because, as noted above, we are not comfortable with an underlying model in which 
the coefficients can be exactly zero. Another approach would just take the lasso prior and do 
full Bayesian inference, in which case the coefficients are partially pooled. This makes some 
sense but in that case there is no particular advantage to the double-exponential family and 
we might as well use something from the t family instead. In addition, in any particular 
application we might have information suggesting that some coefficients should be pooled 
more than others—there is no need to use the same prior distribution on all of them. 

As always, once we think seriously about the model there are potentially endless com- 
plications. For all its simplifications, lasso is a useful step in the right direction, allowing 
the automatic use of many more predictors than would be possible under least squares. 


14.7 Unequal variances and correlations 


The data distribution (14.1) makes several assumptions—linearity of the expected value, 
E(y|0, X), as a function of X, normality of the error terms, and independent observations 
with equal variance—none of which is true in common practice. As always, the question 
is: does the gap between theory and practice adversely affect our inferences? Following the 
methods of Chapters 6 and 7, we can check posterior estimates and predictions to see how 
well the model fits relevant aspects of the data, and we can fit competing models to the 
same dataset to see how sensitive the inferences are to assumptions. 

In the regression context, we could try to reduce nonlinearity of E(y|0, X) by including 
explanatory variables, transformed appropriately. Nonlinearity can be diagnosed by plots of 
residuals against explanatory variables. If there is some concern about the proper relation 
between y and X, try several regressions: fitting y to various transformations of X. For 
example, in medicine, the degree of improvement of a patient may depend on the age of the 
patient. It is common for this relationship to be nonlinear, perhaps increasing for younger 
patients and then decreasing for older patients. Introduction of a nonlinear term such as a 
quadratic may improve the fit of the model. 

Nonnormality is sometimes apparent for structural reasons—for example, when y only 
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takes on discrete values—and can also be diagnosed by residual plots, as in Figure 14.3 for 
the incumbency example. If nonnormality is a serious concern, transformation is the first 
line of attack, or a generalized linear model may be appropriate; see Chapter 16 for details. 

Unequal variances of the regression errors, yi — X;3, can sometimes be detected by plots 
of absolute residuals versus explanatory variables. Often the solution is just to include 
more explanatory variables. For example, a regression of agricultural yield in a number of 
geographic areas, on various factors concerning the soil and fertilizer, may appear to have 
unequal variances because local precipitation was not included as an explanatory variable. 
In other cases, unequal variances are a natural part of the data collection process. For 
example, if the sampling units are hospitals, and each data point is obtained as an average 
of patients within a hospital, then the variance is expected to be roughly proportional to 
the reciprocal of the sample size in each hospital. For another example, data collected by 
two different technicians of different proficiency will presumably exhibit unequal variances. 
We discuss models with more than one variance parameter in Section 14.7. 

Correlations between (yi — X;,3) and (y; — X;) (conditional on X and the model pa- 
rameters) can sometimes be detected by examining the correlation of residuals with respect 
to the possible cause of the problem. For example, if sampling units are collected sequen- 
tially in time, then the autocorrelation of the residual sequence should be examined and, if 
necessary, modeled. The usual linear model is not appropriate because the time informa- 
tion was not explicitly included in the model. If correlation exists in the data but is not 
included in the model, then the posterior inference about model parameters will typically 
be falsely precise, because the n sampling units will contain less information than n inde- 
pendent sampling units. In addition, predictions for future data will be inaccurate if they 
ignore correlation between relevant observed units. For example, heights of siblings remain 
correlated even after controlling for age and sex. But if we also control for genetically re- 
lated variables such as the heights of the two parents (that is, add two more columns to 
the X matrix), the siblings’ heights will have a lower (but in general still positive) correla- 
tion. This example suggests that it may often be possible to use more explanatory variables 
to reduce the complexity of the covariance matrix and thereby use more straightforward 
analyses. However, nonzero correlations will always be required when the values of y for 
particular subjects under study are related to each other through mechanisms other than 
systematic dependence on observed covariates. 


Modeling unequal variances and correlated errors 


Unequal variances and correlated errors can be included in the linear model by allowing a 
data covariance matrix 4, that is not necessarily proportional to the identity matrix: 


y ~ N(X6,%y), (14.11) 


Modeling and estimation are, in general, more difficult than in ordinary linear regression. 
The symmetric, positive definite n x n data variance matrix X, must be specified or given 
an informative prior distribution. 


Bayesian regression with a known covariance matrix 


We first consider the simplest case of unequal variances and correlated errors, where the 
variance matrix 4, is known. We continue to assume a noninformative uniform prior distri- 
bution for 8. The computations in this section will be useful later as an intermediate step 
in iterative computations for more complicated models with informative prior information 
and hierarchical structures. 
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The posterior distribution. The posterior distribution is nearly identical to that for ordi- 
nary linear regression with known variance, if we apply a simple linear transformation to 


X and y. Let sy ? bea Cholesky factor (an upper triangular ‘matrix square root’) of Uy. 


Multiplying both sides of the regression equation (14.11) by Hy 1/2 yields 


—1/2 —1/2 
Ez /? yd, X ~ N(©7 Y? XB, 1). 


Drawing posterior simulations. To draw samples from the posterior distribution with data 
variance 4, known, first compute the Cholesky factor x ? and its inverse, Ly 2 then 
repeat the procedure of Section 14.2 with o fixed at 1, replacing y by Dz y and X by 
D X, Algebraically, this means replacing (14.4) and (14.5) by 


b = (XE X) X'i Y (14.12) 
Vs = CEA (14.13) 


with the posterior distribution given by (14.3). As with (14.4) and (14.5), the matrix 
inversions never actually need to be computed since the Cholesky decomposition should be 
used for computation. 


Prediction. Suppose we wish to sample from the posterior predictive distribution of ñ new 
observations, y, given an n x k matrix of explanatory variables, X. Prediction with nonzero 
correlations is more complicated than in ordinary linear regression because we must specify 
the joint variance matrix for the old and new data. For example, consider a regression of 
children’s heights in which the heights of children from the same family are correlated with 
a fixed known correlation. If we wish to predict the height of a new child whose brother is in 
the old dataset, we should use that correlation in the predictions. We will use the following 
notation for the joint normal distribution of y and y, given the explanatory variables and 
the parameters of the regression model: 


TEN. XB Dy Xps 
C a 8) 


The covariance matrix for (y, 9) must be symmetric and positive semidefinite. 
Given (y, 8, £y), the heights of the new children have a joint normal posterior predictive 
distribution with mean and variance matrix, 


E(ly,8,2y) = XB+ Ez u- XB) 
var(g|y, p, Xy) E Xz _ Suey hod 


which can be derived from the properties of the multivariate normal distribution; see (3.13) 
on page 72 and (A.1) on page 582. 


Bayesian regression with unknown covariance matrix 


We now derive the posterior distribution when the covariance matrix is unknown. As usual, 
we divide the problem of inference in two parts: posterior inference for @ conditional on 
X „—which we have just considered—and posterior inference for X}. Assume that the prior 
distribution on £ is uniform, with fixed scaling not depending on %,; that is, p(G|X,,) « 1. 
Then the marginal posterior distribution of 4, can be written as 


p(B, Xyly) 

p(B|Ey, y) 

p(X y)N(y18, Ey) (14.14) 
N(5|8, Va) 


p(Xyly) 


This electronic edition is for non-commercial purposes only. 


372 14. INTRODUCTION TO REGRESSION MODELS 


where Ê and Vg depend on “Ny and are defined by (14.12)-(14.13). Expression (14.14) 
must hold for any 6 (since the left side of the equation does not depend on £ at all); for 
convenience and computational stability we set 6 = 8: 


p(Dyly) x p(Zy)|Dyl-2/2| Val"? exp (-Su -XA E y- xô) (14.15) 


Difficulties with the general parameterization. The density (14.15) is easy to compute but 
hard to draw samples from in general, because of the dependence of Ê and |Va|*/2 on £y. 
Perhaps more important, setting up a prior distribution on }y is, in general, a difficult task. 
In the next section we discuss several important special cases of parameterizations of the 
variance matrix, focusing on models with unequal variances but zero correlations. 


Variance matrix known up to a scalar factor 


Suppose we can write the data variance Ly as 
=O. (14.16) 


where the matrix Q, is known but the scale ø is unknown. As with ordinary linear regres- 
sion, we start by assuming a noninformative prior distribution, p(3,07) x 0~?. 

To draw samples from the posterior distribution of (8,07), one must now compute Q; 1/2 
and then repeat the procedure of Section 14.2 with o? unknown, replacing y by Qy 1/ Py and 


X by Q7” X. Algebraically, this means replacing (14.4) and (14.5) by, 


B = (XO XY xO, y (14.17) 
Ve = (X7Q,'X)", (14.18) 

and (14.7) by 
3? =- XB)" Q5y— XA), (14.19) 


with the normal and scaled inverse-y? distributions (14.3) and (14.6). These formulas are 
just generalizations of the results for ordinary linear regression (for which Q, = J). As 
with (14.4) and (14.5), the matrix inversions do not actually need to be computed since the 
Cholesky decomposition should be used for computation. 


Weighted linear regression 


If the data variance matrix is diagonal, then the above model is called weighted linear 
regression. We use the notation 
Xu = 0° /wi, 


where w1,..., Wn are known ‘weights,’ and g? is an unknown variance parameter. Think of 
w as an additional X variable that does not affect the mean of y but does affect its variance. 
The procedure for weighted linear regression is the same as for the general matrix version, 
with the simplification that Q} + = diag(wi,..., Wn). 


Parametric models for unequal variances 


More generally, the variances can depend nonlinearly on the inverse weights: 


La = 07 v(wj, o), (14.20) 
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where ¢ is an unknown parameter and v is some function such as v(wi,¢) = w; * This 
parameterization has the feature of continuously changing from equal variances at ¢ = 0 to 
variances proportional to 1/w; when ¢ = 1 and can thus be considered a generalization of 
weighted linear regression. (Another simple functional form with this feature is v(w;,¢) = 
(1 — ¢) + ¢/w;.) A reasonable noninformative prior distribution for ¢ is uniform on [0, 1]. 

Before analysis, the weights w; are multiplied by a constant factor set so that their 
product is 1, so that inference for ¢ will not be affected by the scaling of the weights. If this 
adjustment is not done, the joint prior distribution of ø and ¢ must be set up to account 
for the scale of the weights (see the bibliographic note for more discussion and Exercise 6.5 
for a similar example). 


Drawing posterior simulations. The joint posterior distribution is 


p(B, 07, bly) x p()p(8, 0714) | [ Nal Xi, vlw: o). (14.21) 


i=1 


Assuming the usual noninformative prior density, p(3,loga|¢) « 1, we can factor the pos- 
terior distribution, and draw simulations, as follows. 
First, given ¢, the model is just weighted linear regression with 


Qy = diag(v(w1, Q), yes ,U(Wn, ¢)). 


To perform the computation, just replace X and y by OOX and Qz y, respectively, 
and follow the linear regression computations. 
Second, the marginal posterior distribution of ¢ is 


p(8, 07, oly) 

p(B, 07 |, y) 

poo’ Ii- Nil XiB, 07 u(wi, 6) 
Inv-x?(o2|n—k, s2)N(B|B, Vga?) l 


plely) = 


This equation holds in general, so it must hold for any particular value of (6,07). For 
analytical convenience and computational stability, we evaluate the expression at ((, s”). 
Also, recall that the product of the weights is 1, so we now have 


p(dly) x pA) Va|? s707, (14.22) 


where Â, Vz, s depend on ¢ and are given by (14.17)-(14.19). Expression (14.22) is not a 
standard distribution, but for any specified weights w and functional form v(w;,@), it can 
be evaluated at a grid of ¢ in [0,1] to yield a numerical posterior density, p(y). 

It is then easy to draw joint posterior simulations in the order ¢, o°, 3. 


Estimating several unknown variance parameters 


Thus far we have considered regression models with a single unknown variance parameter, 
g, allowing us to model data with equal variances and zero correlations (in ordinary linear 
regression) or unequal variances with known variance ratios and correlations (in weighted 
linear regression). Models with several unknown variance parameters arise when different 
groups of observations have different variances and also, perhaps more importantly, when 
considering hierarchical regression models, as we discuss in Section 15.1. 

When there is more than one unknown variance parameter, there is no general method 
to sample directly from the marginal posterior distribution of the variance parameters, and 
we generally must resort to iterative simulation techniques. In this section, we discuss linear 
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regression with unequal variances and a noninformative prior distribution; Section 14.8 and 
Chapter 15 describe how to account for prior information. 

Many parametric models are possible for unequal variances; here, we discuss models in 
which the variance matrix X, is known up to a diagonal vector of variances. If the variance 
matrix has unknown nondiagonal components, the computation is more difficult. 


Example. Estimating the incumbency advantage (continued) 

In the analysis described in Section 14.3, it is reasonable to suppose that congressional 
elections with incumbents running for reelection are less variable than open-seat elec- 
tions, because of the familiarity of the voters with the incumbent candidate. Love him 
or hate him, at least they know him, and so their votes should be predictable. Or 
maybe the other way around—when two unknowns are running, people vote based on 
their political parties, while the incumbency advantage is a wild card that helps some 
politicians more than others. In any case, incumbent and open-seat elections seem 
different, and we might try modeling them with two different variance parameters. 


Notation. Suppose the n observations can be divided into I batches—n, data points of 
type 1, n2 of type 2, ..., nz of type J—with each type of observation having its own 
variance parameter to be estimated, so that we must estimate J scalar variance parameters 
01,02,..-,07, instead of just ø. This model is characterized by expression (14.11) with 
covariance matrix ©, that is diagonal with n; instances of ø? for each i = 1,..., I, and 
yy ni =n. In the incumbency example, J = 2 (incumbents and open seats). 


A noninformative prior distribution. To derive the natural noninformative prior distribu- 
tion for the variance components, think of the data as J separate experiments, each with its 
own unknown independent variance parameter. Multiplying the J separate noninformative 
prior distributions, along with a uniform prior distribution on the regression coefficients, 
yields p(8, £y) « ee a The posterior distribution of the variance o? is proper only 
if ni > 2; if the ith batch comprises only one observation, its variance parameter o? must 
have an informative prior specification. 

For the incumbency example, there are enough observations in each year so that the 
results based on a noninformative prior distribution for (8, 0?, o2) may be acceptable. We 
follow our usual practice of performing a noninformative analysis and then examining the 
results to see where it might make sense to improve the model. 


Posterior distribution. The joint posterior density of 8 and the variance parameters is 


I 
p(B, 0,- ..,o3ly) œ (11 a ap(-50 -XAS wX), (1423) 
i=l 


where the matrix ©, itself depends on the variance parameters o?. The conditional pos- 


terior distribution of 8 given the variance parameters is just the weighted linear regression 
result with known variance matrix, and the marginal posterior distribution of the variance 
parameters is given by 


= 1 x = n 
P(Lyly) « p(Zy)|Val*/?|Zy|-"/? exp (- ee ys xô) 

(see (14.15)), with the understanding that X, is parameterized by the vector (o7,...,07), 
and with the prior density p(X,) « I o’. 

Example. Estimating the incumbency advantage (continued) 

Comparing variances by examining residuals. We can get a rough idea of what to 

expect by examining the average residual standard deviations for the two kinds of ob- 

servations. In the post-1940 period, the residual standard deviations were, on average, 
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Figure 14.4 Posterior medians of standard deviations 01 and o2 for elections with incumbents (solid 
line) and open-seat elections (dotted line), 1898-1990, estimated from the model with two variance 
components. (These years are slightly different from those in Figure 14.2 because this model was 
fit to a slightly different dataset.) 


higher for open seats than contested elections. As a result, a model with equal vari- 
ances distorts estimates of 6 somewhat because it does not ‘know’ to treat open-seat 
outcomes as more variable than contested elections. 


Fitting the regression model with two variance parameters. For each year, we fit the 
model, y ~ N(X6,X,), in which X, is diagonal with Yj; equal to of for districts 
with incumbents running and g2 for open seats. We used EM to find a marginal 
posterior mode of (o7,03) and used the normal approximation about the mode as a 
starting point for the Gibbs sampler; in this case, the normal approximation is accurate 
(otherwise we might use a t approximation instead), and three independent sequences, 
each of length 100, were more than sufficient to bring the estimated potential scale 
reduction, R, to below 1.1 for all parameters. 

The inference for the incumbency coefficients over time is virtually unchanged from the 
equal-variance model, and so we do not bother to display the results. The inferences 
for the two variance components is displayed in Figure 14.4. The variance estimate 
for the ordinary linear regressions (not shown here) followed a pattern similar to the 
solid line in Figure 14.4, which makes sense considering that most of the elections have 
incumbents running (recall Figure 14.1). The most important difference between the 
two models is in the predictive distribution—the unequal-variance model realistically 
models the uncertainties in open-seat and incumbent elections since 1940. Further 
improvement could probably be made by pooling information across elections using a 
hierarchical model. 

Even the new model is subject to criticism. For example, the spiky time series pattern 
of the estimates of o2 does not look right; more smoothness would seem appropriate, 
and the variability apparent in Figure 14.4 is due to the small number of open seats per 
year, especially in more recent years. A hierarchical time series model (which we do 
not cover in this book) would be an improvement on the current noninformative prior 
on the variances. In practice, one can visually smooth the estimates in the graph, but 
for the purpose of estimating the size of the real changes in the variance, a hierarchical 
model would be preferable. 
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General models for unequal variances 
All the models we have considered so far follow the general form, 


E(y|X,0) = XB 
log(var(y|X,9)) = Wọ, 


where W is a specified matrix of parameters governing the variance (log weights for weighted 
linear regression, indicator variables for unequal variances in groups), and ¢ is the vector 
of variance parameters. In the general form of the model, iterative simulation methods 
including the Metropolis algorithm can be used to draw posterior simulations of (8, ¢). 


14.8 Including numerical prior information 


In some ways, prior information is already implicitly included in the classical regression; 
for example, we usually would not bother including a control variable if we thought it had 
no substantial predictive value. The meaning of the phrase ‘substantial predictive value’ 
depends on context, but can usually be made clear in applications; recall our discussion of 
the choice of variables to include in the incumbency advantage regressions. 

Here we show how to add conjugate prior information about regression parameters to the 
classical regression model. This is of interest as a Bayesian approach to classical regression 
models and, more importantly, because the same ideas return in the hierarchical normal 
linear model in Chapter 15. We express all results in terms of expanded linear regressions. 


Coding prior information on a regression parameter as an extra ‘data point’ 


First, consider adding prior information about a single regression coefficient, 6;. Suppose 
we can express the prior information as a normal distribution: 


Bi ~ N(Byo, 0%,)» 


with bjo and o%, known. 

We can determine the posterior distribution by considering the prior information on /; 
to be another ‘data point’ in the regression. An observation in regression can be described 
as a normal random variable y with mean x8 and variance o?. The prior distribution for 
j can be seen to have the same form as a typical observation because the normal density 
for 6; is equivalent to a normal density for jo with mean fj and variance 05. 


p(B;) x > exp (- (Bi — Bio)" pal | ; 


2 
i 20%, 


Thus considered as a function of 8}, the prior distribution can be viewed as an ‘observation’ 
Bjo with corresponding ‘explanatory variables’ equal to zero, except xj, which equals 1, 
and a ‘variance’ of T3,- To include the prior distribution in the regression, just append one 
data point, jo, to y, one row of all zeros except for a 1 in the jth column to X, and a 
diagonal element of o3, to the end of X, (with zeros on the appended row and column of 
X, away from the diagonal). Then apply the computational methods for a noninformative 
prior distribution: conditional on X4, the posterior distribution for 8 can be obtained by 
weighted linear regression. 

To understand this formulation, consider two extremes. In the limit of no prior infor- 
mation, corresponding to o3, — œ, we are just adding a data point with infinite variance, 
which has no effect on inference. In the limit of perfect prior information, corresponding 
to o3, = 0, we are adding a data point with zero variance, which has the effect of fixing pj 
exactly at 8jo. 
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Interpreting prior information on several coefficients as several additional ‘data points’ 


Now consider prior information about the whole vector of parameters in 8: 


B ~ N(6o, X6). 


We can treat the prior distribution as k prior ‘data points, and get correct posterior infer- 
ence by weighted linear regression applied to ‘observations’ yx, ‘explanatory variables’ X,, 
and ‘variance matrix’ X, where 


= yY _ {xX _ {à 0 
a(t) se h sel -I (14.24) 


If some of the components of 8 have infinite variance (that is, noninformative prior distri- 
butions), they should be excluded from these added ‘prior’ data points to avoid infinities in 
the matrix ©,. Or, if we are careful, we can just work with X71 and its Cholesky factors and 
never explicitly compute X.. The joint prior distribution for 8 is proper if all k components 
have proper prior distributions; that is, if xB has rank k. 

Computation conditional on %, is straightforward using the methods described in Sec- 
tion 14.7 for regression with known covariance matrix. One can determine the marginal 
posterior density of X, analytically and use the inverse-cdf method to draw simulations. 
More complicated versions of the model, such as arise in hierarchical regression, can be 
computed using the Gibbs sampler. 


Prior information about variance parameters 


In general, prior information is less important for the parameters describing the variance 
matrix than for the regression coefficients because o° is generally of less substantive interest 
than 8. Nonetheless, for completeness, we show how to include such prior information. 

For the normal linear model (weighted or unweighted), the conjugate prior distribution 
for a? is scaled inverse-y?, which we will parameterize as 


o? ~ Inv-x?(no, og). 


Using the same trick as above, this prior distribution is equivalent to no prior data points— 
no need not be an integer—with sample variance of. The resulting posterior distribution 


is also scaled inverse-y? and can be written as 


noo + =) 
? 


o*\y ~ Inv-x? (ro +n, 
no +n 


in place of (14.6). If prior information about 8 is also supplied, s? is replaced by the 
corresponding value from the regression of y+ on X. and U., and n is replaced by the length 
of y,. In either case, we can directly draw from the posterior distribution for o?. In the 
algebra and computations, one must replace n by n+ no everywhere and add terms nooĝ to 
every estimate of o?. If there are several variance parameters, then we can use independent 
conjugate prior distributions, or it may be better to model them hierarchically. 


Prior information in the form of inequality constraints on parameters 


Another form of prior information that is easy to incorporate in our simulation framework 
is inequality constraints, such as 6; > 0, or B2 < B3 < B4. Constraints such as positivity or 
monotonicity occur in many problems. For example, recall the nonlinear bioassay example 
in Section 3.7 for which it might be sensible to constrain the slope to be nonnegative. 


This electronic edition is for non-commercial purposes only. 


378 14. INTRODUCTION TO REGRESSION MODELS 


Monotonicity can occur if the regression model includes an ordinal categorical variable that 
has been coded as indicator variables: we might wish to constrain the higher levels to have 
coefficients at least as large as the lower levels of the variable. 

A simple and often effective way to handle constraints in a simulation is just to ig- 
nore them until the end: obtain simulations of (8,0) from the unconstrained distribution, 
then simply discard any simulations that violate the constraints. Discarding is reason- 
ably efficient unless the constraints eliminate a large portion of the unconstrained posterior 
distribution, in which case the data are tending to contradict the model. 


14.9 Bibliographic note 


Linear regression is described in detail in many textbooks, for example Weisberg (1985) 
and Neter et al. (1996). For other presentations of Bayesian linear regression, see Zellner 
(1971), Box and Tiao (1973), and, for a more informally Bayesian treatment, Gelman and 
Hill (2007). For analysis with a conjugate normal-inverse-y? prior that allows analytic 
integration over 8 and o see, for example O’Hagan and Forster (2004). Fox (2002) presents 
linear regression using the statistical package R. The computations of linear regression, 
including the QR decomposition and more complicated methods that are more efficient for 
large problems, are described in many places; for example, Gill, Murray, and Wright (1981) 
and Golub and van Loan (1983). Gelman, Goegebeur, et al. (2000), Pardoe (2001) and 
Pardoe and Cook (2002) discuss Bayesian graphical regression model checks. 

The incumbency in elections example comes from Gelman and King (1990a); more recent 
work in this area includes Cox and Katz (1996) and Ansolabehere and Snyder (2002). 
Gelman and Huang (2008) frame the problem using a hierarchical model. The general 
framework used for causal inference, also discussed in Chapter 8 of this book, is presented 
in Rubin (1974b, 1978a); for a related but slightly different perspective, see Pearl (2010). 
Bayesian approaches to analyzing regression residuals appear in Zellner (1975), Chaloner 
and Brant (1988), and Chaloner (1991). 

Lasso regression was presented by Tibshirani (1996) as a way to estimate a large number 
of parameters so that, with sparse data, many of the coefficients will be estimated at zero. 
Bayesian lasso was presented by Park and Casella (2008) using MCMC computation and by 
Seeger (2008) using expectation propagation. Carvalho, Polson, and Scott (2010) propose 
the ‘horseshoe’ distribution, a continuous Cauchy mixture model of normal distributions, as 
a hierarchical prior distribution for regression coefficients; that article also reviews several 
other families of prior distributions that have been proposed for this purpose. Polson and 
Scott (2012) discuss shrinkage priors more generally. Arminger (1998), Murray et al. (2013) 
and Hoff and Niu (2012) consider models for latent structured covariance matrices that can 
be viewed as dimension-reduction techniques. 

A variety of parametric models for unequal variances have been used in Bayesian analy- 
ses; Boscardin and Gelman (1996) present some references and an example with forecasting 
Presidential elections (see Section 15.2). 

Correlation models are important in many areas of statistics; see, for example, Box 
and Jenkins (1976), Brillinger (1981), and Pole, West, and Harrison (1994) for time series 
analysis, Kunsch (1987) and Cressie (1993) for spatial statistics, and Mugglin, Carlin, and 
Gelfand (2000) and Banerjee, Carlin, and Gelfand (2004) for space-time models. 


14.10 Exercises 


1. Analysis of radon measurements: 
(a) Fit a linear regression to the logarithms of the radon measurements in Table 7.3, with 
indicator variables for the three counties and for whether a measurement was recorded 
on the first floor. Summarize your posterior inferences in nontechnical terms. 
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(b) Suppose another house is sampled at random from Blue Earth County. Sketch the 
posterior predictive distribution for its radon measurement and give a 95% predictive 
interval. Express the interval on the original (unlogged) scale. (Hint: you must 
consider the separate possibilities of basement or first-floor measurement.) 

2. Causal inference using regression: discuss the difference between finite-population and 
superpopulation inference for the incumbency advantage example of Section 14.3. 

3. Ordinary linear regression: derive the formulas for Ê and Vg in (14.4)-(14.5) for the 
posterior distribution of the regression parameters. 

4. Ordinary linear regression: derive the formula for s? in (14.7) for the posterior distribu- 
tion of the regression parameters. 

5. Analysis of the milk production data: consider how to analyze data from the cow exper- 
iment described in Section 8.4. Specifically: 

(a) Discuss the role of the treatment assignment mechanism for the appropriate analysis 
from a Bayesian perspective. 


(b) Discuss why you would focus on finite-population inferences for these 50 cows or on 
superpopulation inferences for the hypothetical population of cows from which these 
50 are conceptualized as a random sample. Either focus is legitimate, and a reasonable 
answer might be that one is easier than the other, but if this is your answer, say why 
it is true. 

6. Ordinary linear regression: derive the conditions that the posterior distribution is proper 
in Section 14.2. 

7. Posterior predictive distribution for ordinary linear regression: show that p(g|o,y) is a 
normal density. (Hint: first show that p(y, Blo, y) is the exponential of a quadratic form 
in (gy, 8) and is thus is a normal density.) 

8. Expression of prior information as additional data: give an algebraic proof of (14.24). 

9. Lasso regularization: 


(a) Write the (unnormalized) posterior density for regression coefficients with a lasso prior 
with parameter À. Suppose the least-squares estimate is 8 with covariance matrix 
V,o7. (For simplicity, treat the data variance o as known.) 
(b) Suppose 8 is one-dimensional. Give the lasso (posterior mode) estimate of £8. 
(c) Suppose 8 is multidimensional. Explain why, in general, the lasso estimate cannot 
simply be found by pulling the least-squares estimate of each coefficient toward zero. 
10. Lasso regularization: Find a linear regression example, in an application area of interest 
to you, with many predictors. 


(a) Do a lasso regression, that is, a Bayesian regression using a double-exponential prior 
distribution with hyperparameter À estimated from data, summarizing by the posterior 
mode of the coefficients. 


(b) Get uncertainty in this estimate by bootstrapping the data 100 times and repeating 
the above step for each bootstrap sample. 


(c) Now fit a fully Bayesian lasso (that is, the same model as in (a) but assigning a 
hyperprior to \ and obtaining simulations over the entire posterior distribution. 


(d) Compare your inferences in (a), (b), and (c). 

11. Straight-line fitting with variation in x and y: suppose we wish to model two variables, 
x and y, as having an underlying linear relation with added errors. That is, with data 
(z,y)i,t = 1,...,n, we model a ~ N((#),2), and v; = a+ bu;. The goal is to 
estimate the underlying regression parameters, (a, b). 
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Body mass (kg) Body surface (em?) Metabolic rate (kcal/day) 
31.2 10750 1113 
24.0 8805 982 
19.8 7500 908 
18.2 7662 842 
9.6 5286 626 
6.5 3724 430 
3.2 2423 281 


Table 14.3 Data from the earliest study of metabolic rate and body surface area, measured on a set 
of dogs. From Schmidt-Nielsen (1984, p. 78). 


12 


13 


14. 


(a) Assume that the values u; follow a normal distribution with mean p and variance 7°. 
Write the likelihood of the data given the parameters; you can do this by integrating 
over U1,..., Un Or by working with the multivariate normal distribution. 

(b) Discuss reasonable noninformative prior distributions on (a, b). 

See Snedecor and Cochran (1989, p. 173) for an approximate solution, and Gull (1989b) 
for a Bayesian treatment of the problem of fitting a line with errors in both variables. 


. Straight-line fitting with variation in x and y: you will use the model developed in the 


previous exercise to analyze the data on body mass and metabolic rate of dogs in Table 
14.3, assuming an approximate linear relation on the logarithmic scale. In this case, 
the errors in © are presumably caused primarily by failures in the model and variation 
among dogs rather than ‘measurement error.’ 


(a) Assume that log body mass and log metabolic rate have independent ‘errors’ of equal 
variance, ¢?. Assuming a noninformative prior distribution, compute posterior simu- 
lations of the parameters. 

(b) Summarize the posterior inference for b and explain the meaning of the result on the 
original, untransformed scale. 

(c) How does your inference for b change if you assume a variance ratio of 2? 


. Straight-line fitting with variation in 71, x2, and y: adapt the model used in the previous 


exercise to the problem of estimating an underlying linear relation with two predictors. 
Estimate the relation of log metabolic rate to log body mass and log body surface area 
using the data in Table 14.3. How does the near-collinearity of the two predictor variables 
affect your inference? 

Heaped data and regression: As part of a public health study, the ages and heights were 
recorded for several hundred young children in Africa. The goal was to see how many 
children were too short, given their age. File reported_ages. png shows a histogram of 
the reported ages (in months). The spikes at 12 months, etc., suggest that some (but 
not all) of the ages are rounded (it is not obvious whether they are being rounded up or 
down). You can assume that some ages are reported exactly, some are rounded to the 
nearest 6 months, and some are rounded to the nearest 12 months. 


(a) Set up a model for these data. You can use the notation a; for the reported ages and 
hi for the heights (which are measured, essentially exactly). 
(b) Write the likelihood for your model. 


(c) Describe how you would estimate the parameters in your model. 
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Chapter 15 


Hierarchical linear models 


Hierarchical regression models are useful as soon as there are predictors at different levels 
of variation. For example, in studying scholastic achievement we may have information 
about individual students (for example, family background), class-level information (char- 
acteristics of the teacher), and also information about the school (educational policy, type 
of neighborhood). Another situation in which hierarchical modeling arises naturally is in 
the analysis of data obtained by stratified or cluster sampling. A natural family of models 
is regression of y on indicator variables for strata or clusters, in addition to any measured 
predictors x. With cluster sampling, hierarchical modeling is in fact necessary in order to 
generalize to the unsampled clusters. 

With predictors at multiple levels, the assumption of exchangeability of units or subjects 
at the lowest level breaks down. The simplest extension from classical regression is to 
introduce as predictors a set of indicator variables for each of the higher-level units in 
the data—that is, for the classes in the educational example or for the strata or clusters 
in the sampling example. But this will in general dramatically increase the number of 
parameters in the model, and sensible estimation of these is only possible through further 
modeling, in the form of a population distribution. The latter may itself take a simple 
exchangeable or independent and identically distributed form, but it may also be reasonable 
to consider a further regression model at this second level, to allow for the predictors defined 
at this level. In principle there is no limit to the number of levels of variation that can be 
handled in this way. Bayesian methods provide ready guidance on handling the estimation of 
unknown parameters, although computational complexities can be considerable, especially 
if one moves out of the realm of conjugate normal specifications. In this chapter we give a 
brief introduction to the broad topic of hierarchical linear models, emphasizing the general 
principles used in handling normal models. 

In fact, we have already considered a hierarchical linear model in Chapter 5: the problem 
of estimating several normal means can be considered as a special case of linear regression. 
In the notation of Section 5.5, the data points are yj, j = 1,...,J, and the regression 
coefficients are the school parameters 0;. In this example, therefore, n = J; the number of 
‘data points’ equals the number of explanatory variables. The X matrix is just the J x J 
identity matrix, and the individual observations have known variances g7: Section 5.5 
discussed the flaws of no pooling and complete pooling of the data, y1,..., YJ, to estimate 
the parameters, 0j. In the regression context, no pooling corresponds to a noninformative 
uniform prior distribution on the regression coefficients, and complete pooling corresponds 
to the J coefficients having a common prior distribution with zero variance. The favored 
hierarchical model corresponds to a prior distribution of the form 8 ~ N(u, T°T). 

In the next section, we present notation and computations for the simple varying- 
coefficients model, which constitutes the simplest version of the general hierarchical linear 
model (of which the eight schools example of Section 5.5 is in turn a simple case). We illus- 
trate in Section 15.2 with the example of forecasting U.S. presidential elections, and then 
go on to the general form of the hierarchical linear model in Section 15.3. Throughout, we 
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assume a normal linear regression model for the likelihood, y ~ N(X 8, £y), as in Chapter 
14, and we label the regression coefficients as 6;, j =1,..., J. 


15.1 Regression coefficients exchangeable in batches 


We begin by considering hierarchical regression models in which groups of the regression 
coefficients are exchangeable and are modeled with normal population distributions. Each 
such group is called a batch of random effects or varying coefficients. 


Simple varying-coefficients model 


In the simplest form of the random-effects or varying-coefficients model, all of the regression 
coefficients contained in the vector 8 are exchangeable, and their population distribution 
can be expressed as 


B~ N(1a, 0o31), (15.1) 
where a and og are unknown scalar parameters, and / is the J x 1 vector of ones, 1 = 
(1,...,1)7. We use this vector-matrix notation to allow for easy generalization to regression 


models for the coefficients 8, as we discuss in Section 15.3. Model (15.1) is equivalent to 
the hierarchical model we applied to the educational testing example of Section 5.5, using 
(8,a, og) in place of (6,41,7). As in that example, this general model includes, as special 
cases, unrelated 8;’s (og = 00) and all 6;’s equal (ag = 0). 

It can be reasonable to start with a prior density that is uniform on a, øg, as we used in 
the educational testing example. As discussed in Section 5.4, we cannot assign a uniform 
prior distribution to logog (the standard ‘noninformative’ prior distribution for variance 
parameters), because this leads to an improper posterior distribution with all its mass in 
the neighborhood of og = 0. Another relatively noninformative prior distribution that is 
often used for 02 is scaled inverse-x? (see Appendix A) with the degrees of freedom set 
to a low number such as 2. In applications one should be careful to ensure that posterior 
inferences are not sensitive to these choices; if they are, then greater care needs to be taken 
in specifying prior distributions that are defensible on substantive grounds. If there is 
little replication in the data at the level of variation corresponding to a particular variance 
parameter, then that parameter is generally not well estimated by the data and inferences 
may be sensitive to prior assumptions. 


Intraclass correlation 


There is a straightforward connection between the varying-coefficients model just described 
and a within-group correlation. Suppose data y1,..., Yn fall into J batches and have a multi- 
variate normal distribution: y ~ N(a1,X,), with var(y;)=7? for alli, and cov(yi,, yi.) = pn? 
if i} and ig are in the same batch and 0 otherwise. (We use the notation 1 for the n x 1 
vector of 1’s.) If p > 0, this is equivalent to the model y ~ N(X 8,07), where X isan x J 
matrix of indicator variables with X;; = 1 if unit 7 is in batch j and 0 otherwise, and 8 
has the varying-coefficients population distribution (15.1). The equivalence of the models 
occurs when 7? = 07 +04 and p = 03/(07 +03), as can be seen by deriving the marginal 
distribution of y, averaging over 8. More generally, positive intraclass correlation in a linear 
regression can be subsumed into a varying-coefficients model by augmenting the regression 
with J indicator variables whose coefficients have the population distribution (15.1). 


Mixed-effects model 


An important variation on the simple varying-coefficients or random-effects model is the 
‘mixed-effects model,’ in which the first Jı components of 8 are assigned independent im- 
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proper prior distributions, and the remaining J2 = J — Jı components are exchangeable 
with common mean a and standard deviation og. The first Jı components, which are 
implicitly modeled as exchangeable with infinite prior variance, are sometimes called fixed 
effects. 

A simple example is the hierarchical normal model considered in Chapter 5; the varying- 
coefficients model with the school means normally distributed and a uniform prior density 
assumed for their mean a is equivalent to what is sometimes called a mixed-effects model 
with a single constant ‘fixed effect’ and a set of random effects with mean 0. 


Several sets of varying coefficients 


To generalize, allow the J components of 8 to be divided into K clusters of coefficients, 
with cluster k having population mean a, and standard deviation ggk. A mixed-effects 
model is obtained by setting the variance to oo for one of the clusters of coefficients. We 
return to these models in discussing the analysis of variance in Section 15.7. 


Exchangeability 


The essential feature of varying-coefficient models is that exchangeability of the units of 
analysis is achieved by conditioning on indicator variables that represent groupings in the 
population. The varying coefficients allow each subgroup to have a different mean outcome 
level, and averaging over these parameters to a marginal distribution for y induces a cor- 
relation between outcomes observed on units in the same subgroup (just as in the simple 
intraclass correlation model described above). 


15.2 Example: forecasting U.S. presidential elections 


We illustrate hierarchical linear modeling with an example in which a hierarchical model 
is useful for obtaining realistic forecasts. Following standard practice, we begin by fitting 
a nonhierarchical linear regression with a noninformative prior distribution but find that 
the simple model does not provide an adequate fit. Accordingly we expand the model 
hierarchically, including varying coefficients to model variation at a second level in the 
data. 

Political scientists in the U.S. have been interested in the idea that national elections 
are highly predictable, in the sense that one can accurately forecast election results using 
information publicly available several months before the election. In recent years, several 
different linear regression forecasts have been suggested for U.S. presidential elections. In 
this chapter, we present a hierarchical linear model that was estimated from the elections 
through 1988 and used to forecast the 1992 election. 


Unit of analysis and outcome variable 


The units of analysis are results in each state from each of the 11 presidential elections 
from 1948 through 1988. The outcome variable of the regression is the Democratic party 
candidate’s share of the two-party vote for president in that state and year. For convenience 
and to avoid tangential issues, we discard the District of Columbia (in which the Democrats 
have received over 80% in every presidential election) and states with third-party victories 
from our model, leaving us with 511 units from the 11 elections considered. 


1The terms ‘fixed’ and ‘random’ come from the non-Bayesian statistical tradition and are somewhat 
confusing in a Bayesian context where all unknown parameters are treated as ‘random’ or, equivalently, as 
having fixed but unknown values. 
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Dem vote by state (1984) 


Dem vote by state (1972) 


Figure 15.1 (a) Democratic share of the two-party vote for president, for each state, in 1984 and 
1988. (b) Democratic share of the two-party vote for president, for each state, in 1972 and 1976. 


Preliminary graphical analysis 


Figure 15.1a suggests that the presidential vote may be strongly predictable from one elec- 
tion to the next. The fifty points on the figure represent the states of the U.S. (indicated 
by their two-letter abbreviations); the x and y coordinates of each point show the Demo- 
cratic party’s share of the vote in the presidential elections of 1984 and 1988, respectively. 
The points fall close to a straight line, indicating that a linear model predicting y from 
x is reasonable and relatively precise. The pattern is not always so strong, however; con- 
sider Figure 15.1b, which displays the votes by states in 1972 and 1976—the relation is 
not close to linear. Nevertheless, a careful look at the second graph reveals some patterns: 
the greatest outlying point, on the upper left, is Georgia (‘GA’), the home state of Jimmy 
Carter, the Democratic candidate in 1976. The other outlying points, all on the upper left 
side of the 45° line, are other states in the South, Carter’s home region. It appears that 
it may be possible to create a good linear fit by including other predictors in addition to 
the Democratic share of the vote in the previous election, such as indicator variables for 
the candidates’ home states and home regions. (For political analysis, the United States is 
typically divided into four regions: Northeast, South, Midwest, and West, with each region 
containing ten or more states.) 


Fitting a preliminary, nonhierarchical, regression model 


Political trends such as partisan shifts can occur nationwide, at the level of regions of the 
country, or in individual states; to capture these three levels of variation, we include three 
kinds of explanatory variables in the regression. The nationwide variables—which are the 
same for every state in a given election year—include national measures of the popularity of 
the candidates, the popularity of the incumbent President (who may or may not be running 
for reelection), and measures of the condition of the economy in the past two years. Regional 
variables include home-region indicators for the candidates and various adjustments for 
past elections in which regional voting had been important. Statewide variables include 
the Democratic party’s share of the state’s vote in recent presidential elections, measures 
of the state’s economy and political ideology, and home-state indicators. The explanatory 
variables used in the model are listed in Table 15.1. With 511 observations, a large number 
of state and regional variables can reasonably be fitted in a model of election outcome, 
assuming there are smooth patterns of dependence on these covariates across states and 
regions. Fewer relationships with national variables can be estimated, however, since for 
this purpose there are essentially only 11 data points—the national elections. 
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Sample quantiles 


Description of variable 
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min median max 
Nationwide variables: 
Support for Dem. candidate in Sept. poll 0.37 0.46 0.69 
(Presidential approval in July poll) x Inc —0.69 —0.47 0.74 
(Presidential approval in July poll) x Presinc —0.69 0 0.74 
(2nd quarter GNP growth) x Inc —0.024  —0.005 0.018 
Statewide variables: 
Dem. share of state vote in last election —0.23 —0.02 0.41 
Dem. share of state vote two elections ago —0.48 —0.02 0.41 
Home states of presidential candidates —1 0 1 
Home states of vice-presidential candidates —1 0 1 
Democratic majority in the state legislature —0.49 0.07 0.50 
(State economic growth in past year) x Inc —0.22 —0.00 0.26 
Measure of state ideology —0.78 —0.02 0.69 
Ideological compatibility with candidates —0.32 —0.05 0.32 
Proportion Catholic in 1960 (compared to U.S. avg.) —0.21 0 0.38 
Regional/subregional variables: 
South 0 0 1 
(South in 1964) x (—1) —1 0 0 
(Deep South in 1964) x (—1) —1 0 0 
New England in 1964 0 0 1 
North Central in 1972 0 0 1 
(West in 1976) x (—1) —1 0 0 


Table 15.1 Variables used for forecasting U.S. presidential elections. Sample minima, medians, and 
maxima come from the 511 data points. All variables are signed so that an increase in a variable 
would be expected to increase the Democratic share of the vote in a state. ‘Inc’ is defined to be 
+1 or —1 depending on whether the incumbent President is a Democrat or a Republican. ‘Presinc’ 
equals Inc if the incumbent President is running for reelection and 0 otherwise. ‘Dem. share of state 
vote’ in last election and two elections ago are coded as deviations from the corresponding national 
votes, to allow for a better approximation to prior independence among the regression coefficients. 
‘Proportion Catholic’ is the deviation from the average proportion in 1960, the only year in which 
a Catholic ran for President. See Gelman and King (1993) and Boscardin and Gelman (1996) for 
details on the other variables, including a discussion of the regional/subregional variables. When 
fitting the hierarchical model, we also included indicators for years and regions within years. 


For a first analysis, we fit a classical regression including all the variables in Table 15.1 
to the data up to 1988, as described in Chapter 14. We could then draw simulations from 
the posterior distribution of the regression parameters and use each of these simulations, 
applied to the national, regional, and state explanatory variables for 1992, to create a 
random simulation of the vector of election outcomes for the fifty states in 1992. These 
simulated results could be used to estimate the probability that each candidate would win 
the national election and each state election, the expected number of states each candidate 
would win, and other predictive quantities. 


Checking the preliminary regression model 


The ordinary linear regression model ignores the year-by-year structure of the data, treating 
them as 511 independent observations, rather than 11 sets of roughly 50 related observations 
each. Substantively, the feature of these data that such a model misses is that partisan 
support across the states does not vary independently: if, for example, the Democratic 
candidate for President receives a higher-than-expected vote share in Massachusetts in a 
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Figure 15.2 Scatterplot showing the joint distribution of simulation draws of the realized test quan- 
tity, T (y, B)—the square root of the average of the 11 squared nationwide residuals—and its hypo- 
thetical replication, T(y"°?, 8), under the nonhierarchical model for the election forecasting example. 
The 200 simulated points are far below the 45° line, which means that the realized test quantity is 
much higher than predicted under the model. 


particular year, we would expect him also to perform better than expected in Utah in 
that year. In other words, because of the known grouping into years, the assumption of 
exchangeability among the 511 observations does not make sense, even after controlling for 
the explanatory variables. 

An important use of the model is to forecast the nationwide outcome of the presidential 
election. One way of assessing the significance of possible failures in the model is to use the 
model-checking approach of Chapter 6. To check whether correlation of the observations 
from the same election has a substantial effect on nationwide forecasts, we create a test 
variable that reflects the average precision of the model in predicting the national result— 
the square root of the average of the squared nationwide realized residuals for the 11 general 
elections in the dataset. (Each nationwide realized residual is the average of (yi — Xib) for 
the roughly 50 observations in that election year.) This test variable should be sensitive 
to positive correlations of outcomes in each year. We then compare the values of the test 
variable T(y, 8) from the posterior simulations of 8 to the hypothetical replicated values 
under the model, T(y"°?, 3). The results are displayed in Figure 15.2. As can be seen in the 
figure, the observed variation in national election results is larger than would be expected 
from the model. The practical consequence of the failure of the model is that its forecasts 
of national election results are falsely precise. 


Extending to a varying-coefficients model 


We can improve the regression model by adding an additional predictor for each year to 
serve as an indicator for nationwide partisan shifts unaccounted for by the other national 
variables; this adds 11 new components of 3 corresponding to the 11 election years in the 
data. The additional columns in X are indicator vectors of zeros and ones indicating which 
observations correspond to which year. After controlling for the national variables already 
in the model, we fit an exchangeable model for the election-year variables, which in the 
normal model implies a common mean and variance. Since a constant term is already in 
the model, we can set the mean of the population distribution of the year-level errors to 
zero (recall Section 15.1). By comparison, the classical regression model we fitted earlier is 
a special case of the current model in which the variance of the 11 election-year coefficients 
is fixed at zero. 

In addition to year-to-year variability not captured by the model, there are also electoral 
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swings that follow the region of the country—Northeast, South, Midwest, and West. To 
capture regional variability, we include 44 region x year indicator variables (also with 
the mean of the population distributions set to zero) to cover all regions in all election 
years. Within each region, we model these indicator variables as exchangeable. Because 
the South tends to act as a special region of the U.S. politically, we give the 11 Southern 
regional variables their own common variance, and treat the remaining 33 regional variables 
as exchangeable with their own variance. In total, we have added 55 new 8 parameters 
and three new variance components to the model, and we have excluded the regional and 
subregional corrections in Table 15.1 associated with specific years. We can write the model 
for data in states s, regions r(s), and years t, 


Yst ™ N(Xstb + Yr(s)t + Ôt, a”) 
N(0,72,) for r=1,2,3 (non-South) 
va N(0, T52) forr=4 (South) 


ôs ~ N(0,72), (15.2) 


with a uniform hyperprior distribution on 6,0,7,1,7 72,75. (We also fit with a uniform 
prior distribution on the hierarchical variances, rather than the standard deviations, and 
obtained essentially identical inferences.) 

In the notation of Section 15.1, we would label 8 as the concatenated vector of all the 
varying coefficients, (8, y, ô) in formulation (15.2). The augmented £ has a prior with mean 
0 and diagonal variance matrix Ug = diag(oo,...,00, Ty, +++) Tay) Tase Tyas Torie TS) 
The first 14 elements of 3, corresponding to the constant term and the predictors in Table 
15.1, have noninformative priors—that is, 7g; = oo for these elements. The next 33 values 
of og; are set to the parameter T41. The 11 elements corresponding to the Southern regional 
variables have og; = Ty2. The final 11 elements correspond to the nationwide shifts and 
have prior standard deviation Ts. 


Forecasting 


Predictive inference is more subtle for a hierarchical model than a classical regression model, 
because of the possibility that new parameters (varying coefficients) 6 must be estimated for 
the predictive data. Consider the task of forecasting the outcome of the 1992 presidential 
election, given the 50 x 20 matrix of explanatory variables for the linear regression corre- 
sponding to the 50 states in 1992. To form the complete matrix of explanatory variables for 
1992, X, we must include 55 columns of zeros, thereby setting the indicators for previous 
years to zero for estimating the results in 1992. Even then, we are not quite ready to make 
predictions. To simulate draws from the predictive distribution of 1992 state election results 
using the hierarchical model, we must include another year indicator and four new region 
x year indicator variables for 1992—this adds five new predictors. However, we have no 
information on the coefficients of these predictor variables; they are not even included in the 
vector 8 that we have estimated from the data up to 1988. Since we have no data on these 
five new components of 8, we must simulate their values from their posterior (predictive) 
distribution; that is, the coefficient for the year indicator is drawn as N(0, 72), the non- 
South region x year coefficients are drawn as N(0, T21), and the South x year coefficient is 


drawn as N(0, Taa) using the values T5, T41, 7y2 drawn from the posterior simulation. 


Posterior inference 


We fit the model using EM and the vector Gibbs sampler as described in Section 15.5, 
to obtain a set of draws from the posterior distribution of (8,0,75,7 1,7 2). We ran ten 
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Figure 15.3 Scatterplot showing the joint distribution of simulation draws of the realized test quan- 
tity, T(y,8)—the square root of the average of the 11 squared nationwide residuals—and its hy- 
pothetical replication, T(y™?, 3), under the hierarchical model for the election forecasting example. 
The 200 simulated points are scattered evenly about the 45° line, which means that the model ac- 
curately fits this particular test quantity. 


parallel Gibbs sampler sequences; after 500 steps, the potential scale reductions, R, were 
below 1.1 for all parameters. 

The coefficient estimates for the variables in Table 15.1 are similar to the results from the 
preliminary, nonhierarchical regression. The posterior medians of the coefficients all have the 
expected positive sign. The hierarchical standard deviations 75, 71, 7,2 are not determined 
with great precision. This points out one advantage of the full Bayesian approach; if we had 
simply made point estimates of these variance components, we would have been ignoring a 
wide range of possible values for all the parameters. 

When applied to data from 1992, the model yields state-by-state predictions that are 
summarized in Figure 6.1, with a forecasted 85% probability that the Democrats would 
win the national electoral vote total. The forecasts for individual states have predictive 
standard errors between 5% and 6%. 

We tested the model in the same way as we tested the nonhierarchical model, by a 
posterior predictive check on the average of the squared nationwide residuals. The simula- 
tions from the hierarchical model, with their additional national and regional error terms, 
accurately fit the observed data, as shown in Figure 15.3. 


Reasons for using a hierarchical model 


In summary, there are three main advantages of the hierarchical model here: 
e It allows the modeling of correlation within election years and regions. 


e Including the year and region x year terms without a hierarchical model, or not including 
these terms at all, correspond to special cases of the hierarchical model with T=oo or 0, 
respectively. The more general model allows for a reasonable compromise between these 
extremes. 


e Predictions will have additional components of variability for regions and year and should 
therefore be more reliable. 


15.3 Interpreting a normal prior distribution as extra data 


More general forms of the hierarchical linear model can be created, with further levels of 
parameters representing additional structure in the problem. For instance, building on the 
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brief example in the opening paragraph of this chapter, we might have a study of educational 
achievement in which class-level effects are not considered exchangeable but rather depend 
on features of the school district or state. In a similar vein, the election forecasting example 
might be extended to attempt some modeling of the year-by-year errors in terms of trends 
over time, although there is probably limited information on which to base such a model 
after conditioning on other observed variables. No serious conceptual or computational 
difficulties are added by extending the model to more levels. 
A general formulation of a model with three levels of variation is: 


y|X, b, Sy ~ N(XB, £) ‘likelihood’ n data points y; 
B|Xg,a,4g~N(Xga,Xg) ‘population distribution’ J parameters 8j 
alao, Ha ~ N(ao, Ha) ‘hyperprior distribution’ K parameters az 


Interpretation as a single linear regression 


The conjugacy of prior distribution and regression likelihood (see Section 14.8) allows us to 
express the hierarchical model as a single normal regression model using a larger ‘dataset’ 
that includes as ‘observations’ the information added by the population and hyperprior 
distributions. Specifically, for the three-level model, we can extend (14.24) to write 


Yal Xs 7; de res N(Xx7, E), 


where y is the vector (8,a) of length J + K, and y,, X., and E7! are defined by consid- 
ering the likelihood, population, and hyperprior distributions as n + J + K ‘observations’ 
informative about y: 


y X 0 Dt 0 0 
y=| 0], X% =| I -Xs |, X= 0 Bay 0 |. (15.3) 
ao 0 Ik 0 0 x, 


If any of the components of 6 or a have noninformative prior distributions, the corre- 
sponding rows in y, and X,, as well as the corresponding rows and columns in E71, can 
be eliminated, because they correspond to ‘observations’ with infinite variance. The result- 
ing regression then has n + J, + K, ‘observations,’ where J, and K., are the number of 
components of 8 and a, respectively, with informative prior distributions. 

For example, in the election forecasting example, 3 has 75 components—20 predictors in 
the original regression (including the constant term but excluding the five regional variables 
in Table 15.1 associated with specific years), 11 year errors, and 44 region x year errors— 
but J, is only 55 because only the year and region x year errors have informative prior 
distributions. (All three groups of varying coefficients have means fixed at zero, so in this 
example K., = 0.) 


More than one way to set up a model 


A hierarchical regression model can be set up in several equivalent ways. For example, 
we have already noted that the hierarchical model for the 8 schools could be written as 
yj ~ N(0;,03) and 0; ~ N(u,77), or as yj ~ N(8o + 8;,07) and 8; ~ N(0,77). The 
hierarchical model for the election forecasting example can be written, as described above, 
as a regression, y ~ N(X,o7J), with 70 predictors X and normal prior distributions on 
the coefficients 8, or as a regression with 20 predictors and three independent error terms, 
corresponding to year, region x year, and state x year. 

In the three-level formulation described at the beginning of this section, group-level 
predictors can be included in either the likelihood or the population distribution, and the 
constant term can be included in any of the three regression levels. 
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15.4 Varying intercepts and slopes 


So far we have focused on hierarchical models for scalar parameters. What do we do when 
multiple parameters can vary by group? Then we need a multivariate prior distribution. 
Consider the following generic model of data in J groups and, within each group j, a 
likelihood, p(y |@M), depending on a vector of parameters 6, which themselves are given 
a prior distribution, p(6;|¢), given hyperparameters ¢. 

The model can also have parameters that do not vary by group, for example in the 
varying-intercept, varying-slope linear model: 


yij ~ N(aj +213 8;, 04) 


( 3 ) = (( F i ( A ae Me (15.4) 


We then assign a hyperprior distribution to the vector of hyperparameters, (Ha, HB, a, C8, P), 
probably starting with a uniform (with the constraints that the scale parameters must be 
positive and the correlation parameter is between —1 and 1) and then adding more infor- 
mation as necessary (for example, if the number of groups is low). 

When more than two coefficients vary by group, we can write (15.4) in vector-matrix 
form as, 


vig ~ N(X;8, 02) 
Bp ~ N(ug, De). (15.5) 


The model could be further elaborated, for example by having the data variance itself vary 
by group, or by adding structure to the mean vector, or by having more than one level of 
grouping. In any case, the key is the prior distribution on Ug. (The data variance a, and the 
mean vector ug also need hyperprior distributions, but these parameters can typically be 
estimated pretty well from the data, so their exact prior specification is not so important.) 


Inverse- Wishart model 


Let K be the number of coefficients in the regression model, so that 8 isa J x K matrix, 
and the group-level variance matrix Xg is K x K. The conditionally conjugate distribution 
for Ug in model (15.5) is the inverse-Wishart (see Section 3.6). 


Scaled inverse- Wishart model 


The trouble with the Inv-Wishartx+1(Z) model is that it strongly constrains the variance 
parameters, the diagonal elements of the covariance matrix. What we want is a model 
that is noninformative on the correlations but allows a wider range of uncertainty on the 
variances. We can create such a model using a redundant parameterization, rewriting (15.5) 
as, 


vij ~ N(Xj(up +€@n), 07) 
1 ~ N(0, £n). (15.6) 


Here, € is a vector of length K, the symbol ® represents component-wise multiplication, 
we have decomposed the coefficient vector from each group as BY) = ug +€ @ n, and the 
covariance matrix is 


Ug = Diag(£)=,,Diag(S). 
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The advantage of splitting up the model in this way is that now we can assign separate 
prior distributions to € and &,,, inducing a richer structure of uncertainty modeling. It is, 
in fact, an example of the scaled inverse-Wishart model described at the end of Section 3.6. 
As anoninformative model, we can set ©, ~ Inv-Wishart «+1 (I), along with independent 
uniform prior distributions on €’s. If necessary, one can add informative priors to £. 


Predicting business school grades for different groups of students 


We illustrate varying-intercept, varying-slope regression with an example of prediction in 
which the data are so sparse that a hierarchical model is necessary for inference about some 
of the estimands of interest. 

It is common for schools of business management in the United States to use regression 
equations to predict the first-year grade point average of prospective students from their 
scores on the verbal and quantitative portions of the Graduate Management Admission Test 
(GMAT-V and GMAT-Q) as well as their undergraduate grade point average (UGPA). This 
equation is important because the predicted score derived from it may play a central role in 
the decision to admit the student. The coefficients of the regression equation are typically 
estimated from the data collected from the most recently completed first-year class. 

A concern was raised with this regression model about possible biased predictions for 
identifiable subgroups of students, particularly black students. A study was performed based 
on data from 59 business schools over a two-year period, involving about 8500 students of 
whom approximately 4% were black. For each school, a separate regression was performed 
of first-year grades on four explanatory variables: a constant term, GMAT-V, GMAT-Q, 
and UGPA. By looking at the residuals for all schools and years, it was found that the 
regressions tended to overpredict the first-year grades of blacks. 

At this point, it might seem natural to add another term to the regression model corre- 
sponding to an indicator variable that is 1 if a student is black and 0 otherwise. However, 
such a model was considered too restrictive; once blacks and non-blacks were treated sepa- 
rately in the model, it was desired to allow different regression models for the two groups. 
For each school, the expanded model then has eight explanatory variables: the four men- 
tioned above, and then the same four variables multiplied by the indicator for black students. 
For student i = 1,..., nj in school j = 1,...,59 we model the first-year grade point average 
Yij, given the vector of eight covariates zij, as a linear regression with coefficient vector bj 
and residual variance oF. Then the model for the entire vector of responses y is 


59 nj 
p(y|B,07) ~ JI J [Nvu Xi 8;, 03). 


j=1 i=1 


Geometrically, the model is equivalent to requiring two different regression planes: one 
for blacks and one for non-blacks. For each school, nine parameters must be estimated: 
Bij; ---,B8j;0j. Algebraically, all eight terms of the regression are used to predict the 
scores of blacks but only the first four terms for non-blacks. 

At this point, the procedure of estimating separate regressions for each school becomes 
impossible using standard least-squares methods, which are implicitly based on noninfor- 
mative prior distributions. Blacks comprise only 4% of the students in the dataset, and 
many of the schools are all non-black or have so few blacks that the regression parameters 
cannot be estimated under classical regression (that is, based on a noninformative prior dis- 
tribution on the nine parameters in each regression). Fortunately, it is possible to estimate 
all 8 x 59 regression parameters simultaneously using a hierarchical model. To use the most 
straightforward approach, the 59 vectors 8; are modeled as independent samples from a 
multivariate N(a, Ag) distribution, with unknown vector a and 8 x 8 matrix Ag. 

The unknown parameters of the model are then 6, a, Ag, and oj,...,059. We first 
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assumed a uniform prior distribution on a, Ag, logo1,...,logos59. This noninformative ap- 
proach is not ideal (at the least, one would want to embed the 59 oj; parameters in a 
hierarchical model) but is a reasonable start. 

A crude approximation to œ and the parameters o? was obtained by running a regres- 
sion of the combined data vector y on the eight explanatory variables, pooling the data 
from all 59 schools. Using the crude estimates as a starting point, the posterior mode of 
(a, Ag,o7,...,0%9) was found using EM. Here, we describe the conclusions of the study, 
which were based on this modal approximation. 

One conclusion from the analysis was that the multivariate hierarchical model is a sub- 
stantial improvement over the standard model, because the predictions for both black and 
non-black students are relatively accurate. Moreover, the analysis revealed systematic dif- 
ferences between predictions for black and non-black students. In particular, conditioning 
the test scores at the mean scores for the black students, in about 85% of schools, non- 
blacks were predicted to have higher first-year grade-point averages, with over 60% of the 
differences being more than one posterior standard deviation above zero, and about 20% 
being more than two posterior standard deviations above zero. This sort of comparison, 
conditional on school and test scores, could not be reasonably estimated with a nonhier- 
archical model in this dataset, in which the number of black students per school was so 
low. 


15.5 Computation: batching and transformation 


There are several ways to use the Gibbs sampler to draw posterior simulations for hier- 
archical linear models. The different methods vary in their programming effort required, 
computational speed, and convergence rate, with different approaches being reasonable for 
different problems. In general we prefer Hamiltonian Monte Carlo to simple Gibbs; how- 
ever, ideas of Gibbs sampling remain relevant, both for parameterizing the problem so that 
HMC runs more effectively, and because for large problems it can make sense to break up 
a hierarchical model into smaller pieces for more efficient computation. 

We shall discuss computation for models with independent variances at each of the hier- 
archical levels; computation can be adapted to structured covariance matrices as described 
in Section 14.7. 


Gibbs sampler, one batch at a time 


Perhaps the simplest computational approach for hierarchical regressions is to perform a 
blockwise Gibbs sampler, updating each batch of regression coefficients given all the others, 
and then updating the variance parameters. Given the data and all other parameters in 
the model, inference for a batch of coefficients corresponds to a linear regression with fixed 
prior distribution. We can thus update the coefficients of the entire model in batches, 
performing at each step an augmented linear regression, as discussed in Section 14.8. In 
many cases (including the statewide, regional, and national error terms in the election 
forecasting example), the model is simple enough that the means and variances for the Gibbs 
updating do not actually require a regression calculation but instead can be performed using 
simple averages. 

The variance parameters can also be updated one at a time. For simple hierarchical 
models with scalar variance parameters (typically one parameter per batch of coefficients, 
along with one or more data-level variance parameters), the Gibbs updating distributions are 
scaled inverse-y?. For more complicated models, the variance parameters can be updated 
using Metropolis jumps. 

The Gibbs sampler for the entire model then proceeds by cycling through all the batches 
of parameters in the model (including the batch, if any, of coefficients with noninformative 
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or fixed prior distributions). One attractive feature of this algorithm is that it mimics 
the natural way one might try to combine information from the different levels: starting 
with a guess at the upper-level parameters, the lower-level regression is run, and then these 
simulated coefficients are used to better estimate the upper-level regression parameters, and 
so on. 


All-at-once Gibbs sampler 


As noted in Section 15.3, the different levels in a hierarchical regression context can be 
combined into a single regression model by appropriately augmenting the data, predictors, 
and variance matrix. The Gibbs sampler can then be applied, alternately updating the vari- 
ance parameters (with independent inverse-x? distributions given the data and regression 
coefficients), and the vector of regression coefficients, which can be updated at each step by 
running a weighted regression with weight matrix depending on the current values of the 
variance parameters. 

In addition to its conceptual simplicity, this all-at-once Gibbs sampler has the advantages 
of working efficiently even if regression coefficients at different levels are correlated in their 
posterior distribution, as can commonly happen with hierarchical models (for example, the 
parameters 6; and u have positive posterior correlations in the 8-schools example). 

The main computational disadvantage of all-at-once Gibbs sampling for this problem is 
that each step requires a regression on a potentially large augmented dataset. For example, 
the election forecasting model has n = 511 data points and k = 20 predictors, but the 
augmented regression has ną = 511 + 55 observations and ką = 20 + 55 predictors. The 
computer time required to perform a linear regression is proportional to nk”, and thus each 
step of the all-at-once Gibbs sampler can be slow when the number of parameters is large. 
In practice we would implement such a computation using HMC rather than Gibbs. 


Parameter expansion 


Any of the algorithms mentioned above can be slow to converge when estimated hierarchical 
variance parameters are near zero. The problem is that, if the current draw of a hierarchical 
variance parameter is near 0, then in the next updating step, the corresponding batch of 
linear parameters y; will themselves be ‘shrunk’ to be close to their population mean. Then, 
in turn, the variance parameter will be estimated to be close to 0 because it is updated based 
on the relevant y;’s. Ultimately, the stochastic nature of the Gibbs sampler allows it to 
escape out of this trap but this may require many iterations. 

The parameter-expanded Gibbs sampler and EM algorithms (see Sections 12.1 and 13.4) 
can be used to solve this problem. For hierarchical linear models, the basic idea is to 
associate with each batch of regression coefficients a multiplicative factor, which has the 
role of breaking the dependence between the coefficients and their variance parameter. 


Example. Election forecasting (continued) 
We illustrate with the presidential election forecasting model (15.2), which in its 
expanded-parameter version can be written as, 


2,3  (non-South) 


SN Xab + Ge e + 5, 0?) if r(s) = 1, 
Yst 4 (South) 


N(X etb + re + CS o?) if r(s) 


with the same prior distributions as before. The new parameters ¢j°°°", ¢5°°'°", and 


crv" are assigned uniform prior distributions and are not identified in the posterior 
distribution. The products (70° "yr¢ (for r = 1, 2,3), oe oar (for r = 4), and 3°, 
correspond to the parameters yr and ô+, respectively, in the old model. Similarly, the 
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products |GE. ry, , |G], and |C¥°*"|r5 correspond to the variance components 


Tyi; Tyg, and Ts in the original model. 

For the election forecasting model, the variance parameters are estimated precisely 
enough that the ordinary Gibbs sampler performs reasonably efficiently. However, we 
can use this problem to illustrate the computational steps of parameter expansion. 
The Gibbs sampler for the parameter-expanded model alternately updates the regres- 
sion coefficients (6,7,6), the variance parameters (0,741, 72,75), and the additional 
parameters (Q989 çzesion Cy"), The regression coefficients can be updated using 
any of the Gibbs samplers described above—by batch, all at once, or one element at 
a time. The ¢ parameters do not change the fact that 8, y, and 6 can be estimated 
by linear regression. Similarly, the Gibbs sampler updates for the variances are still 
independent inverse-y?. 

The final step of the Gibbs sampler is to update the multiplicative parameters ¢. This 
step turns out to be easy: given the data and the other parameters in the model, the 
information about the ¢’s can be expressed simply as a linear regression of the ‘data’ 
Zst = Yst — Xstß on the ‘predictors’ 7,5) (for states s with regions r(s) = 1, 2, or 3), 
Yr(s)t (for r(s) = 4), and 6;, with variance matrix oJ. The three parameters ¢ are 
then easily updated with a linear regression. 

When running the Gibbs sampler, we do not worry about the individual parameters 
¢, 8,7, 6; instead, we save and monitor the convergence of the variance parameters o 
and the parameters y,; and ô, in the original parameterization (15.2). This is most 
easily done by just multiplying each of the parameters y,,; and 6, by the appropriate 
Ç parameter, and multiplying each of the variance components Ty1,7 2,75, by the 
absolute value of its corresponding ¢. 


More on the parameter-expanded Gibbs sampler for hierarchical models appears at the end 
of Section 15.7. 


Transformations for HMC 


A slightly different transformation can be useful when implementing Hamiltonian Monte 
Carlo. HMC can be slow to converge when certain parameters have very short-tailed or long- 
tailed distributions, and this can happen with the variance parameters or the log variance 
parameters in hierarchical models. A simple decoupling of the model can sometimes solve 
this problem. 


Example. Eight schools model 
We demonstrate with the hierarchical model for the educational testing experiments 
from Section 5.5: 

Yj ~ N(0;, 03), for 7 =1,...,J7 

6; ~ N(u, 77), forj =1,...,J, 


along with some prior distribution, p(u, T). EM or Gibbs for this model can get stuck 
when the value of 7 is near zero. In HMC, the joint posterior distribution of 0, 4, T 
forms a ‘whirlpool’: no single step size works well for the whole distribution, and, 
again, the trajectories are unlikely to go into the region where 7 is near zero and then 
unlikely to leave that vortex when they are there. 

The following parameterization works much better: 


Y ~ N(u +T, 03), lor J= lesd 
n; ~ N(0,1), for j =1,...,J, 


where 6; = u + Tn; for each j. 
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15.6 Analysis of variance and the batching of coefficients 


The largest gains in estimating regression coefficients often come from specifying structure 
in the model. For example, in the election forecasting problem, it is crucial that the national 
and regional indicator variables are clustered and modeled separately from the quantitative 
predictors. In general, when many predictor variables are used in a regression, they should 
be set up so they can be structured hierarchically, so the Bayesian analysis can do the most 
effective job at pooling the information about them. 

Analysis of variance (Anova) represents a key idea in statistical modeling of complex 
data structures—the grouping of predictor variables and their coefficients into batches. 
In the traditional application of analysis of variance, each batch of coefficients and the 
associated row of the Anova table corresponds to a single experimental block or factor or to 
an interaction of two or more factors. In a hierarchical linear regression context, each row 
of the table corresponds to a set of regression coefficients, and we are potentially interested 
in the individual coefficients and also in the variance of the coefficients in each batch. We 
thus view the analysis of variance as a way of making sense of hierarchical regression models 
with many predictors or indicators that can be grouped into batches within which all the 
coefficients are exchangeable. 


Notation and model 


We shall work with the linear model, with the ‘analysis of variance’ corresponding to the 
batching of coefficients into ‘sources of variation,’ with each batch corresponding to one row 


of an Anova table. We use the notation m = 1,...,M for the rows of the table. Each row 
m represents a batch of Jm regression coefficients p j = 1,..., Jm. We denote the m-th 
subvector of coefficients as 80™ = ia”, aoe oe) and the corresponding classical least- 


squares estimate as Bm), These estimates are subject to c,, linear constraints, yielding 
(df)m = Jm — €m degrees of freedom. We label the cm X Jm constraint matrix as CO”), 
so that om) B(m) = 0 for all m, and we assume that C(™ is of rank cm. For notational 
convenience, we label the grand mean as B®, corresponding to the (invisible) zeroth row 
of the Anova table and estimated with no linear constraints. 


The linear model is fitted to the data points y;, i = 1,...,n, and can be written as 
M 
v= J Bip’, (15.7) 
m=0 


where j;” indexes the appropriate coefficient j in batch m corresponding to data point 7. 
Thus, each data point pulls one coefficient from each row in the Anova table. Equation 
(15.7) could also be expressed as a linear regression model with a design matrix composed 
entirely of 0’s and 1’s. The coefficients B7 of the last row of the table correspond to the 
residuals or error term of the model. 

The analysis of variance can also be applied more generally to regression models (or to 
generalized linear models), in which case we can have any design matrix X, and (15.7) is 


replaced by 
M Jm 


Yi = 5 Se (15.8) 
m=0 j=1 
The essence of analysis of variance is in the structuring of the coefficients into batches— 
hence the notation B°") going beyond the usual linear model formulation that has a single 
indexing of coefficients 8j. 
We shall use a hierarchical formulation in which each batch of regression coefficients is 
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modeled as a sample from a normal distribution with mean 0 and its own variance o2,: 
a” ~N(0,07,), forj=1,...,Jm, for each batch m = 1,..., M. 


Without loss of generality, we can center the distribution of each batch 6°” at 0—if it were 
appropriate for the mean to be elsewhere, we would just include the mean of the ae ’S 
as a separate predictor. As in classical Anova, we usually include interactions only if the 
corresponding main effects are in the model. 
The conjugate hyperprior distributions for the variances are scaled inverse-x? distribu- 
tions: 
Om ~ Inv-X? Wm, Fm): 


A natural noninformative prior density is uniform on om, which corresponds to Vm = —1 
and dom = 0. For values of m in which Jm is large (that is, rows of the Anova table 
corresponding to many linear predictors), am is essentially estimated from data. When Jm 
is small, the flat prior distribution implies that ø is allowed the possibility of taking on large 
values, which minimizes the amount of shrinkage in the coefficient estimates. 


Computation 


In this model, the posterior distribution for the parameters (6,0) can be simulated using 
the Gibbs sampler, alternately updating the vector 6 given o with linear regression, and 
updating the vector ø from the independent inverse-y? conditional posterior distributions 
given £. 

The only trouble with this Gibbs sampler is that it can get stuck with variance compo- 
nents om near zero. A more efficient updating uses parameter expansion, as described at 
the end of Section 15.5. In the notation here, we reparameterize into vectors y, Ç, and 7, 
which are defined as follows: 


Om = [Gm |Tm- (15.9) 


y = ACY 
a” ~ N(0,72,) for each m 


, m 


TA ~ Inv-x?(vm, Fon): 


The auxiliary parameters ¢ are given a uniform prior distribution, and then this reduces to 
the original model. The Gibbs sampler then proceeds by updating y (using linear regression 
with n data points and os Jm predictors), ¢ (linear regression with n data points and M 
predictors), and 7 (independent inverse-y? distributions). The parameters in the original 
parameterization, 3 and g, can then be recomputed from (15.9) and stored at each step. 


Finite-population and superpopulation standard deviations 


One measure of the importance of each row or ‘source’ in the Anova table is the standard 
deviation of its constrained regression coefficients, defined as 


Sm = 


1 
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We divide by (df)m = Jm—Cm rather than Jm—1 because multiplying by C(™ induces em 
linear constraints. We model the underlying 8 coefficients as unconstrained. 

For each batch of coefficients 6°, there are two natural variance parameters to esti- 
mate: the superpopulation standard deviation om and the finite-population standard devi- 
ation Sm as defined in (15.10). The superpopulation standard deviation characterizes the 
uncertainty for predicting a new coefficient from batch m, whereas the finite-population 
standard deviation describes the variation in the existing Jm coefficients. 

Variance estimation is often presented in terms of the superpopulation standard devia- 
tions am, but in our Anova summaries, we focus on the finite-population quantities Sm, for 
reasons we shall discuss here. However, for computational reasons, the parameters om are 
useful intermediate quantities to estimate. Our general procedure is to use computational 
methods such as described in Section 15.5 to draw joint posterior simulations of (8,0) and 
then compute the finite-sample standard deviations sm from 8 using (15.10). 

To see the difference between the two variances, consider the extreme case in which 
Jm = 2 (with the usual constraint that a”) + pm) = 0) and a large amount of data 


(m) 
2 


are available in both groups. Then the two parameters aim) and 6 will be estimated 


accurately and so will s2, = $( (m) _ 3("))2_ The superpopulation variance o2,, on the 
other hand, is only being estimated by a measurement that is proportional to a x? with 1 
degree of freedom. We know much about the two parameters B, a” but can say little 
about others from their batch. 

We believe that much of the statistical literature on fixed and random effects can be 
fruitfully reexpressed in terms of finite-population and superpopulation inferences. In some 
contexts (for example, collecting data on the 50 states of the U.S.), the finite population 
seems more meaningful; whereas in others (for example, subject-level effects in a psycho- 
logical experiment), interest clearly lies in the superpopulation. 

For example, suppose a factor has four degrees of freedom corresponding to five different 
medical treatments, and these are the only existing treatments and are thus considered 
‘fixed.’ Suppose it is then discovered that these are part of a larger family of many possible 
treatments, and so it makes sense to model them as ‘random.’ In our framework, the 
inference about these five parameters a and their finite-population and superpopulation 
standard deviations, Sm and o,,, will not change with the news that they can actually 
be viewed as a random sample from a distribution of possible treatment effects. But the 
superpopulation variance now has an important new role in characterizing this distribution. 
The difference between fixed and random effects is thus not a difference in inference or 
computation but in the ways that these inferences will be used. 


Example. Five-way factorial structure for data on Web connect times 

We illustrate the analysis of variance with an example of a linear model fitted for 
exploratory purposes to a highly structured dataset. Data were collected by an inter- 
net infrastructure provider on connect times for messages processed by two different 
companies. Messages were sent every hour for 25 consecutive hours, from each of 45 
locations to 4 different destinations, and the study was repeated one week later. It 
was desired to quickly summarize these data to learn about the importance of different 
sources of variation in connect times. 

Figure 15.4 shows the Bayesian Anova display for an analysis of logarithms of connect 
times on the five factors: destination (‘to’), source (‘from’), service provider (‘com- 
pany’), time of day (‘hour’), and week. The data have a full factorial structure with 
no replication, so the full five-way interaction at the bottom represents the ‘error’ or 
lowest-level variability. 

Each row of the plot shows the estimated finite-population standard deviation of the 
corresponding group of parameters, along with 50% and 95% uncertainty intervals. We 
can immediately see that the lowest-level variation is more important in variance than 
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Estimated sd of effects 
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Figure 15.4 Anova display for the World Wide Web data. The bars indicate 50% and 95% intervals 
for the finite-population standard deviations sm. The display makes apparent the magnitudes and 
uncertainties of the different components of variation. Since the data are on the logarithmic scale, 
the standard deviation parameters can be interpreted directly. For example, Sm = 0.20 corresponds 
to a coefficient of variation of exp(0.2) — 1 ~ 0.2 on the original scale, and so the exponentiated 


coefficients exp(A0” 


) in this batch correspond to multiplicative increases or decreases in the range 


of 20%. (The dots on the bars show simple classical estimates of the variance components that can 
be used as starting points in a Bayesian computation.) 


any of the factors except for the main effect of the destination. Company represents a 
large amount of variation on its own and, perhaps more interestingly, in interaction 
with to, from, and in the three-way interaction. 

Figure 15.4 would not normally represent the final statistical analysis for this sort of 
problem. The Anova plot represents a default model and is a tool for data exploration— 
for learning about which factors are important in predicting the variation in the data— 
and can be used to construct more focused models or design future data collection. 


15.7 Hierarchical models for batches of variance components 


We next consider an analysis of variance problem which has several variance components, 
one for each source of variation, in a 5 x 5 x 2 split-plot latin square with five full-plot 
treatments (labeled A, B, C, D, E), and with each plot divided into two subplots (labeled 


1 and 2). 
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Figure 15.5 Posterior medians, 50%, and 95% intervals for standard deviation parameters Op es- 
timated from a split-plot latin square experiment. (a) The left plot shows inferences given uniform 
prior distributions on the opg’s. (b) The right plot shows inferences given a hierarchical half-Cauchy 
model with scale fit to the data. The half-Cauchy model gives much sharper inferences, using the 
partial pooling that comes with fitting a hierarchical model. 


Source df 
row 4 
column 4 
(A,B,C,D,E) 4 
plot 12 
(1,2) I 
row x (1,2) 4 
column x (1,2) 4 
(A,B,C,D,E) x (1,2) 4 
plot x (1,2) 12 
Each row of the table corresponds to a different variance component, and the split-plot 
Anova can be understood as a linear model with nine variance components, o7,...,03—one 
for each row of the table. A simple noninformative analysis uses a uniform prior distribution, 
p(o1,.--,09) x 1. 


More generally, we can set up a hierarchical model, where the variance parameters have a 
common distribution with hyperparameters estimated from the data. Based on the analyses 
given above, we consider a half-Cauchy prior distribution with peak 0 and scale A, and with 
a uniform prior distribution on A. The hierarchical half-Cauchy model allows most of the 
variance parameters to be small but with the occasionally large oa, which seems reasonable 
in the typical settings of analysis of variance, in which most sources of variation are small 
but some are large. 


Superpopulation and finite-population standard deviations 


Figure 15.5 shows the inferences in the latin square example, given uniform and hierarchical 
half-Cauchy prior distributions for the standard deviation parameters cg. As the left plot 
shows, the uniform prior distribution does not rule out the potential for some extremely high 
values of the variance components—the degrees of freedom are low, and the interlocking 
of the linear parameters in the latin square model results in difficulty in estimating any 
single variance parameter. In contrast, the hierarchical half-Cauchy model performs a great 
deal of shrinkage, especially of the high ranges of the intervals. (For most of the variance 
parameters, the posterior medians are similar under the two models; it is the 75th and 
97.5th percentiles that are shrunk by the hierarchical model.) This is an ideal setting 
for hierarchical modeling of variance parameters in that it combines separately imprecise 
estimates of each of the individual o;’s. 
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Estimated finite-population sd’s 
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Figure 15.6 Posterior medians, 50%, and 95% intervals for finite-population standard deviations 
Sk estimated from a split-plot latin square experiment. (a) The left plot shows inferences given 
uniform prior distributions on the on’s. (b) The right plot shows inferences given a hierarchical 
half-Cauchy model with scale fit to the data. The half-Cauchy model gives sharper estimates even 
for these finite-population standard deviations, indicating the power of hierarchical modeling for 
these highly uncertain quantities. Compare to Figure 15.5 (which is on a different scale). 


The o;’s are superpopulation parameters in that each represents the standard deviation 
for an entire population of groups, of which only a few were sampled for the experiment at 
hand. In estimating variance parameters from few degrees of freedom, it can be helpful also 
to look at the finite-population standard deviation sq of the corresponding linear parameters 
Qj. 

For a simple hierarchical model of the form (5.21), sq is simply the standard deviation 
of the J values of a;. More generally, for more complicated linear models such as the 
split-plot latin square, Sa for any variance component is the root mean square of the coeffi- 
cients’ residuals after projection to their constraint space. In any case, this finite-population 
standard deviation s can be calculated from its posterior simulations and, especially when 
degrees of freedom are low, is more precisely estimated than the superpopulation standard 
deviation ø. 

Figure 15.6 shows posterior inferences for the finite-population standard deviation pa- 
rameters Sq for each row of the latin square split-plot Anova, showing inferences given the 
uniform and hierarchical half-Cauchy prior distributions for the variance parameters ow. 
The half-Cauchy prior distribution does slightly better than the uniform, with the largest 
shrinkage occurring for the variance component that has just one degree of freedom. The 
Cauchy scale parameter A was estimated at 1.8, with a 95% posterior interval of [0.5, 5.1]. 


15.8 Bibliographic note 


Gelman and Hill (2007) present a thorough elementary level introduction to statistical 
modeling with hierarchical linear models. Novick et al. (1972) describe an early application 
of Bayesian hierarchical regression. Lindley and Smith (1972) present the general form for 
the normal linear model (using a slightly different notation than ours); see also Hodges 
(1998). Many interesting applications of Bayesian hierarchical regression have appeared in 
the statistical literature since then; for example, Fearn (1975) analyzes growth curves, Hui 
and Berger (1983) and Strenio, Weisberg, and Bryk (1983) estimate patterns in longitudinal 
data, and Normand, Glickman, and Gatsonis (1997) analyze death rates in a set of hospitals. 
The business school prediction example at the end of Section 15.4 is taken from Braun et 
al. (1983), who perform the approximate Bayesian inference described in the text. Rubin 
(1980b) presents a hierarchical linear regression in an educational example and goes into 
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some detail on the advantages of the hierarchical approach. Other references on hierarchical 
linear models appear at the end of Chapter 5. 

Random-effects or varying-coefficients regression has a long history in the non-Bayesian 
statistical literature; for example, see Henderson et al. (1959). Robinson (1991) provides 
a review, using the term ‘best linear unbiased prediction’.1 The Bayesian approach differs 
by averaging over uncertainty in the posterior distribution of the hierarchical parameters, 
which is important in problems such as the educational testing example of Section 5.5 with 
large posterior uncertainty in the hierarchical variance parameter. 

Prior distributions and Bayesian inference for the covariance matrix of a multivariate 
normal distribution are discussed in Leonard and Hsu (1992), Yang and Berger (1994), 
Daniels and Kass (1999, 2001), and Barnard, McCulloch, and Meng (2000). Each of the 
above works on a different parameterization of the covariance matrix. Wong, Carter, and 
Kohn (2002) discuss prior distributions for the inverse covariance matrix. Verbeke and 
Molenberghs (2000) and Daniels and Pourahmadi (2002) discuss hierarchical linear models 
for longitudinal data. 

Tokuda et al. (2011) present some methods for visualizing prior distributions for covari- 
ance matrices. 

Hierarchical linear modeling has recently gained in popularity, especially in the social 
sciences, where it is often called multilevel modeling. An excellent summary of these appli- 
cations at a fairly elementary level is provided by Raudenbush and Bryk (2002). Other texts 
in this area include Kreft and DeLeeuw (1998) and Snijders and Bosker (1999). Leyland 
and Goldstein (2001) provide an overview of multilevel models for public health research. 
Cressie et al. (2009) discuss hierarchical models in ecology. 

Other key references on multilevel models in social science are Goldstein (1995), Longford 
(1993), and Aitkin and Longford (1986); the latter article is an extended discussion of the 
practical implications of undertaking a detailed hierarchical modeling approach to contro- 
versial issues in school effectiveness studies in the United Kingdom. Sampson, Raudenbush, 
and Earls (1997) discuss a study of crime using a hierarchical model of city neighborhoods. 

Gelman and King (1993) discuss the presidential election forecasting problem in more 
detail, with references to earlier work in the econometrics and political science literature. 
Much has been written on election forecasating; see, for example, Rosenstone (1984) and 
Hibbs (2008). Boscardin and Gelman (1996) provide details on computations, inference, 
and model checking for the model described in Section 15.2 and some extensions. 

Gelfand, Sahu, and Carlin (1995) discuss linear transformations for Gibbs samplers in 
hierarchical regressions, Liu and Wu (1999) and Gelman et al. (2008) discuss the parame- 
ter-expanded Gibbs sampler for hierarchical linear and generalized linear models. Pinheiro 
and Bates present an approach to computing hierarchical models by integrating over the 
linear parameters. 

Much has been written on Bayesian methods for estimating many regression coefficients, 
almost all from the perspective of treating all the coefficients in a problem as exchangeable. 
Ridge regression (Hoerl and Kennard, 1970) is a procedure equivalent to an exchangeable 
normal prior distribution on the coefficients, as has been noted by Goldstein (1976), Wahba 
(1978), and others. Leamer (1978a) discusses the implicit models corresponding to stepwise 
regression and some other methods. George and McCulloch (1993) propose an exchangeable 
bimodal prior distribution for regression coefficients. Madigan and Raftery (1994) propose 
an approximate Bayesian approach for averaging over a distribution of potential regres- 
sion models. Clyde, DeSimone, and Parmigiani (1996) and West (2003) present Bayesian 


‘Posterior means of regression coefficients and ‘random effects’ from hierarchical models are biased 
‘estimates’ but can be unbiased or approximately unbiased when viewed as ‘predictions,’ since conventional 
frequency evaluations condition on all unknown ‘parameters’ but not on unknown ‘predictive quantities’; 
the latter distinction has no meaning within a Bayesian framework. (Recall the example on page 94 of 
estimating daughters’ heights from mothers’ heights.) 
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methods using linear transformations for averaging over large numbers of potential pre- 
dictors. Chipman, Kolaczyk, and McCulloch (1997) consider Bayesian models for wavelet 
decompositions. 

The perspective on analysis of variance given here is from Gelman (2005); previous 
work along similar lines includes Plackett (1960), Yates (1967), and Nelder (1977, 1994), 
and Hodges and Sargent (2001). Volfovsky and Hoff (2012) propose a class of structured 
models for hierarchical regression parameters, going beyond the simple model of coefficients 
exchangeable in batches. 


15.9 Exercises 


1. Varying-coefficients models: express the educational testing example of Section 5.5 as a 
hierarchical linear model with eight observations and known observation variances. Draw 
simulations from the posterior distribution using the methods described in this chapter. 


2. Fitting a hierarchical model for a two-way array: 


(a) Fit a standard analysis of variance model to the randomized block data discussed in 
Exercise 8.5, that is, a linear regression with a constant term, indicators for all but 
one of the blocks, and all but one of the treatments. 


(b) Summarize posterior inference for the (superpopulation) average penicillin yields, aver- 
aging over the block conditions, under each the four treatments. Under this measure, 
what is the probability that each of the treatments is best? Give a 95% posterior 
interval for the difference in yield between the best and worst treatments. 


(c) Set up a hierarchical extension of the model, in which you have indicators for all five 
blocks and all five treatments, and the block and treatment indicators are two sets 
of varying coefficients. Explain why the means for the block and treatment indicator 
groups should be fixed at zero. Write the joint distribution of all model parameters 
(including the hierarchical parameters). 


(d) Compute the posterior mode of the three variance components of your model in (c) 
using EM. Construct a normal approximation about the mode and use this to obtain 
posterior inferences for all parameters and answer the questions in (b). (Hint: you 
can use the general regression framework or extend the procedure in Section 13.6.) 


(e) Check the fit of your model to the data. Discuss the relevance of the randomized 
block design to your check; how would the posterior predictive simulations change if 
you were told that the treatments had been assigned by complete randomization? 


(f) Obtain draws from the actual posterior distribution using the Gibbs sampler, using 
your results from (d) to obtain starting points. Run multiple sequences and monitor 
the convergence of the simulations by computing R for all parameters in the model. 


(g) Discuss how your inferences in (b), (d), and (e) differ. 


3. Regression with many explanatory variables: Table 15.2 displays data from a designed 
experiment for a chemical process. In using these data to illustrate various approaches 
to selection and estimation of regression coefficients, Marquardt and Snee (1975) as- 
sume a quadratic regression form; that is, a linear relation between the expectation of 
the untransformed outcome, y, and the variables 71, £2, %3, their two-way interactions, 
@1X2,1123,12X3, and their squares, x7, xå, v3. 

(a) Fit an ordinary linear regression model (that is, nonhierarchical with a uniform prior 
distribution on the coefficients), including a constant term and the nine explanatory 
variables above. 


(b) Fit a mixed-effects linear regression model with a uniform prior distribution on the 
constant term and a shared normal prior distribution on the coefficients of the nine 
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Reactor Ratio of H2 Contact Conversion of 
temperature to n-heptane time n-heptane to 
(°C), zı (mole ratio), 2 (sec), xs acetylene (%), y 
1300 7.5 0.0120 49.0 
1300 9.0 0.0120 50.2 
1300 11.0 0.0115 50.5 
1300 13.5 0.0130 48.5 
1300 17.0 0.0135 47.5 
1300 23.0 0.0120 44.5 
1200 5.3 0.0400 28.0 
1200 7.5 0.0380 31.5 
1200 11.0 0.0320 34.5 
1200 13.5 0.0260 35.0 
1200 17.0 0.0340 38.0 
1200 23.0 0.0410 38.5 
1100 5.3 0.0840 15.0 
1100 7.5 0.0980 17.0 
1100 11.0 0.0920 20.5 
1100 17.0 0.0860 19.5 


Table 15.2 Data from a chemical experiment, from Marquardt and Snee (1975). The first three 
variables are experimental manipulations, and the fourth is the outcome measurement. 


variables above. If you use iterative simulation in your computations, be sure to use 
multiple sequences and monitor their joint convergence. 

(c) Discuss the differences between the inferences in (a) and (b). Interpret the differences 
in terms of the hierarchical variance parameter. Do you agree with Marquardt and 
Snee that the inferences from (a) are unacceptable? 

(d) Repeat (a), but with a t4 prior distribution on the nine variables. 

(e) Discuss other models for the regression coefficients. 

4. Analysis of variance: 


(a) Create an analysis of variance plot for the educational testing example in Chapter 
5, assuming that there were exactly 60 students in the study in each school, with 30 
receiving the treatment and 30 receiving the control. 

(b) Discuss the relevance of the finite-population and superpopulation standard deviation 
for each source of variation. 


5. Modeling correlation matrices: 


(a) Show that the determinant of a correlation matrix R is a quadratic function of any 
of its elements. (This fact can be used in setting up a Gibbs sampler for multivariate 
models. ) 

(b) Suppose that the off-diagonal elements of a 3 x 3 correlation matrix are 0.4, 0.8, and 
r. Determine the range of possible values of r. 

(c) Suppose all the off-diagonal elements of a d-dimensional correlation matrix R are equal 
to the same value, r. Prove that R is positive definite if and only if -1/(d—1) <r < 1. 


6. Analysis of a two-way stratified sample survey: Section 8.3 and Exercise 11.7 present an 
analysis of a stratified sample survey using a hierarchical model on the stratum proba- 
bilities. That analysis is not fully appropriate because it ignores the two-way structure 
of the stratification, treating the 16 strata as exchangeable. 


(a) Set up a linear model for logit(¢) with three groups of varying coefficients, for the four 
regions, the four place sizes, and the 16 strata. 
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Simplify the model by assuming that the ¢1;’s are independent of the 2;’s. This sepa- 
rates the problem into two generalized linear models, one estimating Bush vs. Dukakis 
preferences, the other estimating ‘no opinion’ preferences. Perform the computations 
for this model to yield posterior simulations for all parameters. 

Expand to a multivariate model by allowing the ¢,;’s and ¢;’s to be correlated. 
Perform the computations under this model, using the results from Exercise 11.7 and 
part (b) above to construct starting distributions. 

Compare your results to those from the simpler model treating the 16 strata as ex- 
changeable. 
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Chapter 16 


Generalized linear models 


This chapter reviews generalized linear models from a Bayesian perspective. We discuss 
prior distributions and hierarchical models. We demonstrate how to approximate the like- 
lihood by a normal distribution, as an approximation and as a step in more general compu- 
tations. Finally, we discuss the class of loglinear models, a subclass of Poisson generalized 
linear models that is commonly used for missing data imputation and discrete multivariate 
outcomes. This chapter is not intended to be exhaustive, but rather to provide enough guid- 
ance so that the reader can combine generalized linear models with the ideas of hierarchical 
models, posterior simulation, prediction, model checking, and sensitivity analysis that we 
have already presented for Bayesian methods in general, and linear models in particular. 

As we have seen in Chapters 14 and 15, a stochastic model based on a linear predictor 
X @ is easy to understand and can work in a variety of problems, especially if we are careful 
about transforming and appropriately interpreting the regression coefficients. The purpose 
of the generalized linear model is to extend the idea of linear modeling to cases for which the 
linear relationship between X and E(y|X) or the normal distribution is not appropriate. 

In some cases, it is reasonable to apply a linear model to a suitably transformed outcome 
using transformed (or untransformed) explanatory variables. For example, in the election 
forecasting example of Chapter 15, the outcome—the incumbent party candidate’s share of 
the vote in a state election—must lie between 0 and 1, so a linear model does not make 
logical sense: it is possible for a combination of explanatory variables, or the variation term, 
to be so extreme that y would exceed its bounds. In that example, however, the boundaries 
are not a serious problem, since the observations are almost all between 0.2 and 0.8, and 
the residual standard deviation is about 0.06. Another case in which a linear model can be 
improved by transformation occurs when the relation between X and y is multiplicative: for 
example, if y; = gt Ae tee ts -variation, then log y; = b1 log xj; +---+ bx log zik + variation, 
and a linear model relating log y; to log zij is appropriate. 

However, the relation between X and E(y|X) cannot always be usefully modeled as 
normal and linear, even after transformation of the data. For example, suppose that y 
cannot be negative, but might be zero. Then we cannot just analyze logy, even if the 
relation of E(y) to X is generally multiplicative. If y is discrete-valued (for example, the 
number of occurrences of a rare disease by county) then the mean of y may be linearly 
related to X, but the variation term cannot be described by the normal distribution. 

The class of generalized linear models unifies the approaches needed to analyze data 
for which either the assumption of a linear relation between x and y or the assumption of 
normal variation is not appropriate. A generalized linear model is specified in three stages: 


1. The linear predictor, 7 = XØ, 

2. The link function g(-) that relates the linear predictor to the mean of the outcome 
variable: u = g~'() = g"'(X8), 

3. The random component specifying the distribution of the outcome variable y with mean 
E(y|X) = u. The distribution can also depend on a dispersion parameter, ¢. 
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Thus, the mean of the distribution of y, given X, is determined by XB: E(y|X) = g7!(X B). 
We use the same notation as in linear regression whenever possible, so that X is the n x p 
matrix of explanatory variables and 7 = X2 is the vector of n linear predictor values. If we 
denote the linear predictor for the ith case by X;( and the variance or dispersion parameter 
(if present) by ¢, then the data distribution takes the form 


p(y|X, 8, ¢) = |] p(yil XB, 4). (16.1) 
4=1 


The most commonly used generalized linear models, for the Poisson and binomial distribu- 
tions, do not require a dispersion parameter; that is, ¢ is fixed at 1. In practice, however, 
excess dispersion is the rule rather than the exception in most applications. 


16.1 Standard generalized linear model likelihoods 
Continuous data 


Linear regression is a special case of the generalized linear model, with the identity link 
function, g(u) = u. For continuous data that are all positive, we can use the normal model 
on the logarithmic scale. When that distributional family does not fit the data, the gamma 
and Weibull distributions are sometimes considered as alternatives. 


Poisson 


Counted data are often modeled using a Poisson model. The Poisson generalized linear 
model, often called the Poisson regression model, assumes that y is Poisson with mean pu 
(and therefore variance u). The link function is typically chosen to be the logarithm, so 
that log u = XG. The distribution for data y = (y1,..-, Yn) is thus 


p(ylB) = 153 e7 P) (exp(n;))”, (16.2) 


where 7; = X;( is the linear predictor for the i-th case. When considering the Bayesian 
posterior distribution, we condition on y, and so the factors of 1/y;! can be absorbed into an 
arbitrary constant. We consider an example of a Poisson regression in Section 16.4 below. 


Binomial 


Perhaps the most widely used of the generalized linear models are those for binary or 
binomial data. Suppose that y; ~ Bin(ni, pi) with n; known. It is common to specify the 
model in terms of the mean of the proportions y;/n;, rather than the mean of y;. Choosing 
the logit transformation of the probability of success, g(ui) = log(u;/(1 — ni)), as the link 
function leads to the logistic regression model. The distribution for data y is 


n ni eni Yi 1 Nni—yi 
eD = IT ("") (sa) (=) 


We have already considered logistic regressions for a bioassay experiment in Section 3.7 and 
a public health survey in Section 6.3, and we present another example in Section 16.5. 

Other link functions are often used; for example, the probit link, g(u) = ®~1(p), is 
commonly used in econometrics. The data distribution for the probit model is 


poi) = JI (2) era = ee 
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The probit link is obtained by retaining the normal variation process in the normal linear 
model while assuming that all outcomes are dichotomized. In practice, the probit and logit 
models are similar, differing mainly in the extremes of the tails. In either case, the factors of 
(ni) depend only on observed quantities and can be subsumed into a constant factor in the 
posterior density. The logit and probit models can also be generalized to model multivariate 
outcomes, as we discuss in Section 16.6. 

The t distribution can be used in a robust alternative to the logit and probit models, 
as discussed in Section 17.2. Another standard link function is g(u) = log(— log(u)), the 
complementary log-log, which is asymmetrical in p (that is, g(u) 4 —g(1 — u) ). 


Overdispersed models 


Classical analyses of generalized linear models allow for the possibility of variation beyond 
that of the assumed sampling distribution, often called overdispersion. For an example, 
consider a logistic regression in which the sampling unit is a litter of mice and the proportion 
of the litter born alive is considered binomial with some explanatory variables (such as 
mother’s age, drug dose, and so forth). The data might indicate more variation than 
expected under the binomial distribution due to systematic differences among mothers. 
Such variation could be incorporated in a hierarchical model using an indicator for each 
mother, with these indicators themselves following a distribution such as normal (which 
is often easy to interpret) or beta (which is conjugate to the binomial prior distribution). 
Section 16.4 gives an example of a Poisson regression model in which overdispersion is 
modeled by including a normal error term for each data point. 


16.2 Working with generalized linear models 
Canonical link functions 


The description of the standard models in the previous section used what are known as 
the canonical link functions for each family. The canonical link is the function of the mean 
parameter that appears in the exponent of the exponential family form of the probability 
density (see page 36). We often use the canonical link, but nothing in our discussion is 
predicated on this choice; for example, the probit link for the binomial and the cumulative 
multinomial models (see Section 16.6) is not canonical. 


Offsets 


It is sometimes convenient to express a generalized linear model so that one of the predic- 
tors has a known coefficient. An predictor of this type is called an offset and commonly 
arises in Poisson models where the rate of occurrence is u per unit of time, so that with 
exposure T the expected number of incidents is uT. We might like to take log u = X as 
in the usual Poisson generalized linear model (16.2); however, generalized linear models are 
parameterized through the mean of y, which is uT, where T now represents the vector of 
exposure times for the units in the regression. We can apply the Poisson generalized linear 
model by augmenting the matrix of explanatory variables with a column containing the 
values log T; this column of the augmented matrix corresponds to a coefficient with known 
value (equal to 1). An example appears in the Poisson regression of police stops in Section 
16.4, where we use a measure of previous crime as an offset. 


Interpreting the model parameters 


The choice and parameterization of the explanatory variables x involve the same considera- 
tions as already discussed for linear models. The warnings about interpreting the regression 
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coefficients from Chapter 14 apply here with one important addition. The linear predictor 
is used to predict the link function g(), rather than u = E(y), and therefore the effect of 
changing the jth explanatory variable x; by a fixed amount depends on the current value 
of x. One way to translate effects onto the scale of y is to measure changes compared to 
a standard case with vector of predictors x9 and standard outcome yo = g~'(a9). Then, 
adding or subtracting the vector Aw leads to a change in the standard outcome from yo 
to g-*(g(yo) + (Ax)8). This expression can also be written in differential form, but it is 
generally more useful to consider changes Ax that are not necessarily close to zero. 


Understanding discrete-data models in terms of latent continuous data 


An important idea, both in understanding and computing discrete-data regressions, is a 
reexpression in terms of unobserved (latent) continuous data. The probit model for binary 
data, Pr(y; = 1) = ®(X;8), is equivalent to the following model on latent data u;: 


ui ~ N(X;8,1) 


1 ifu;>0 
Yvi = { 0 ifu; <0. (16.3) 


Sometimes the latent data can be given a useful interpretation. For example, in a political 
survey, if y = 0 or 1 represents support for the Democrats or the Republicans, then u can 
be seen as a continuous measure of partisan preference. 

Another advantage of the latent parameterization for the probit model is that it allows 
a convenient Gibbs sampler computation. Conditional on the latent u;’s, the model is a 
simple linear regression (the first line of (16.3). Then, conditional on the model parameters 
and the data, the u;’s have truncated normal distributions, with each u; ~ N(X;2,1), 
truncated either to be negative or positive, depending on whether y; = 0 or 1. 

The latent-data idea can also be applied to logistic regression, in which case the first 
line of (16.3) becomes, u; ~ logistic(X;,1), where the logistic distribution has a density 
of 1/(e*/? + e~*/?)that is, the derivative of the inverse-logit function. This model is less 
convenient for computation than the probit (since the regression with logistic errors is not 
particularly easy to compute) but can still be useful for model understanding. 

Similar interpretations can be given to ordered multinomial regressions of the sort de- 
scribed in Section 16.6 below. For example, if the data y can take on the values 0, 1, 2, 
or 3, then an ordered multinomial model can be defined in terms of cutpoints co, c1, C2, SO 
that the response y; equals 0 if u; < co, 1 if u; € (co, c1), 2 if u; € (c1, 2), and 3 if u; > c2. 
There is more than one natural way to parameterize these cutpoint models, and the choice 
of parameterization has implications when the model is placed in a hierarchical structure. 

For example, the likelihood for a generalized linear model under this parameterization 
is unchanged if a constant is added to all three cutpoints co,c,,c2, and so the model is 
potentially nonidentifiable. The typical way to handle this is to set one of the cutpoints at 
a fixed value, for example setting cp = 0, so that only cı and cz need to be estimated from 
the data. (This generalizes the latent-data interpretation of binary data given above, for 
which 0 is the preset cutpoint.) 


Bayesian nonhierarchical and hierarchical generalized linear models 


We consider generalized linear models with noninformative prior distributions on 8, infor- 
mative prior distributions on 8, and hierarchical models for which the prior distribution on 
8 depends on unknown hyperparameters. We attempt to treat all generalized linear models 
with the same broad brush, which causes some difficulties. For example, some generalized 
linear models are expressed with a dispersion parameter in addition to the regression coeffi- 
cients (ø in the normal case); here, we use the general notation ¢ for a dispersion parameter 
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or parameters. A prior distribution can be placed on the dispersion parameter, and any 
prior information about ( can be described conditional on the dispersion parameter; that 
is, p(6,¢) = p(¢)p(8|¢). In the description that follows we focus on the model for 6. We 
defer computational details until the next section. 


Noninformative prior distributions on B 


The classical analysis of generalized linear models is obtained if a noninformative or flat 
prior distribution is assumed for 8. The posterior mode corresponding to a noninformative 
uniform prior density is the maximum likelihood estimate for the parameter 6, which can 
be obtained using iterative weighted linear regression (as implemented in the computer 
packages R or Glim, for example). Approximate posterior inference can be obtained from 
a normal approximation to the likelihood. 


Conjugate prior distributions 


A sometimes helpful approach to specifying prior information about 8 is in terms of hy- 
pothetical data obtained under the same model, that is, a vector, yo, of no hypothetical 
data points and a corresponding ng x k matrix of explanatory variables, Xo. As in Section 
14.8, the resulting posterior distribution is identical to that from an augmented data vector 


(2 )— that is, y and yo strung together as a vector, not the combinatorial coefficient—with 


matrix of explanatory variables ( ea ) and a noninformative uniform prior density on 3. For 
computation with conjugate prior distributions, one can thus use the same iterative methods 
as for noninformative prior distributions. 


Nonconjugate prior distributions 


It is often more natural to express prior information directly in terms of the parameters £. 
For example, we might use the normal model, 6 ~ N(8o, £6) with specified values of 8o 
and %ig. A normal prior distribution on £ is particularly convenient with the computational 
methods we describe in the next section, which are based on a normal approximation to 
the likelihood. 


Hierarchical models 


As in the normal linear model, hierarchical prior distributions for generalized linear models 
are a natural way to fit complex data structures and allow us to include more explanatory 
variables without encountering the problems of ‘overfitting.’ 

A normal distribution for 6 is commonly used so that one can mimic the modeling 
practices and computational methods already developed for the hierarchical normal linear 
model, using the normal approximation to the generalized linear model likelihood described 
in the next section. 


Normal approximation to the likelihood 


Posterior inference in generalized linear models typically requires the approximation and 
sampling tools of Part III. For example, for the hierarchical model for the rat tumor example 
in Section 5.1 the only reason we could compute the posterior distribution on a grid was 
that we assumed a conjugate beta prior distribution for the tumor rates. This trick would 
not be so effective, however, if we had linear predictors for the 70 experiments—at that 
point, it would be more appealing to set up a hierarchical logistic regression, for which 
exact computations are impossible. 
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Pseudodata and pseudovariances. For simple computations it can be convenient to approx- 
imate the likelihood by a normal distribution in 8, conditional on the dispersion parameter 
@, if necessary, and any hierarchical parameters. The basic method is to approximate the 
generalized linear model by a linear model; for each data point y;, we construct a ‘pseu- 
dodatum’ z; and a ‘pseudovariance’ ø? so that the generalized linear model likelihood, 
p(yi|Xi8,), is approximated by the normal likelihood, N(z;|X;8,07). We can then com- 
bine the n pseudodata points and approximate the entire likelihood by a linear regression 
model of the vector z = (21,..., Zn) on the matrix of explanatory variables X, with known 
variance matrix, diag(o7,...,02). This somewhat convoluted approach has the advantage 
of producing an approximate likelihood that we can analyze as if it came from normal linear 
regression data, thereby allowing the use of available linear regression algorithms. 

Center of the normal approximation. In general, the normal approximation will depend on 
the value of 8 (and ¢ if the model has a dispersion parameter) at which it is centered. In the 
following development, we use the notation ( B, 6) for the point at which the approximation is 
centered and 7) = X B for the corresponding vector of linear predictors. In the mode-finding 
stage of the computation, we iteratively alter the center of the normal approximation. Once 
we have approximately reached the mode, we use the normal approximation at that fixed 
value of (Ê, o). 

Determining the parameters of the normal approximation. We can write the log-likelihood 
as 


ply, ---, Ynln, 9) J[ pwin o) 
i=l 


= JI eplin 9), 


where L is the log-likelihood function for the individual observations. We approximate each 
factor in the above product by a normal density in ņ, thus approximating each L(yilm, ¢) 
by a quadratic function in ni: 


1 
L(yilni, $) & — =a (zi = mi)? + constant, 


20 


i 
where, in general, z;, 07, and the constant depend on y, ĝi = (X Bi and ¢. That is, the 
ith data point is approximately equivalent to an observation z;, normally distributed with 
mean 7; and variance o?. 


A standard way to determine z; and g? for the approximation is to match the first- and 
second-order terms of the Taylor series of L(y;|n;,@) centered about i = Xib. Writing a 


and a as L’ and L”, respectively, the result is 
Zi = 
g a a (16.4) 


Example. The binomial-logistic model 
In the binomial generalized linear model with logistic link, the log-likelihood for each 
observation is 


e'i il 
L(yiln) = yilog (; + =) + (ni — yi) log € + =) 


= Yii — Ni log(1 + e"). 
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(There is no dispersion parameter ¢ in the binomial model.) The derivatives of the 
log-likelihood are 


dL e'i 
dni T "TF en 
ÊL e'i 
Tm? = MF en 


Thus, the pseudodata z; and variance o? of the normal approximation for the ith 
sampling unit are 


The approximation depends on B through the linear predictor 1. 


The posterior mode can be found using iterative weighted linear regression: at each step, 
one computes the normal approximation to the likelihood based on the current guess of (3, ¢) 
and finds the mode of the resulting approximate posterior distribution by weighted linear 
regression. (If any prior information is available on £, it should be included as additional 
rows of data and explanatory variables in the regression, as described in Section 14.8 for 
fixed prior information and Chapter 15 for hierarchical models.) Iterating this process is 
equivalent to solving the system of k nonlinear equations, eleu) = 0, using Newton’s 
method, and converges to the mode rapidly for standard generalized linear models. One 
possible difficulty is estimates of 6 tending to infinity, which can occur, for example, in 
logistic regression if there are some combinations of explanatory variables for which p is 
nearly equal to zero or one. Substantive prior information tends to eliminate this problem. 

If a dispersion parameter, ¢, is present, one can update ġ at each step of the iteration by 
maximizing its conditional posterior density (which is one-dimensional), given the current 
guess of 3. Similarly, one can include in the iteration any hierarchical variance parameters 
that need to be estimated and update their values at each step. 


Approximate normal posterior distribution 


Once the mode (3,6) has been reached, one can approximate the conditional posterior 
distribution of 8, given ¢, by the output of the most recent weighted linear regression 
computation; that is, 


p(B|d,y) =~ N(B|8, Va), 


where Vg in this case is (XTdiag(— L” (yi, fi, ¢))X)~!. (In general, one need only compute 
the Cholesky factor of Vg, as described in Section 14.2.) If the sample size n is large, and ¢ 
is not part of the model (as in the binomial and Poisson distributions), we may be content to 
stop here and summarize the posterior distribution by the normal approximation to p(@|y). 

If a dispersion parameter, ¢, is present, one can approximate the marginal distribution 


of ọ using the method of Section 13.5 applied at the conditional mode, 8(¢), 


p(B, oly) 


_ A 1/2 
Papprox($|¥) Pace pyy © PAO), Alv) 


where Ê and Vg in the last expression are the mode and variance matrix of the normal 
approximation conditional on @. 
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More advanced computational methods 


The normal approximation to the posterior distribution can be a good starting point, but 
we can do much better using full Bayesian inference, variational Bayes, or expectation prop- 
agation. Variational Bayes is described in Section 13.7; to apply the method to generalized 
linear models, an additional twist is needed involving a series of normal approximations to 
the likelihood, similar but not identical to those described above. We demonstrate expeca- 
tion propagation for logistic regression in an example in Section 13.8. 


16.3 Weakly informative priors for logistic regression 


Nonidentifiability is a common problem in logistic regression. In addition to the problem of 
collinearity, familiar from linear regression, discrete-data regression can also become unsta- 
ble from separation, which arises when a linear combination of the predictors is perfectly 
predictive of the outcome. Separation is surprisingly common in applied logistic regression, 
especially with binary predictors, and in practical work is often handled inappropriately. 
For example, a common ‘solution’ to separation is to remove predictors until the resulting 
model is identifiable, which typically results in removing the strongest predictors from the 
model. 

An alternative approach to obtaining stable logistic regression coefficients is to use 
Bayesian inference. Here we propose a class of weakly informative priors based on the 
t distribution, along with a default choice that relies on the assumption that, in most 
problems, effects will not be extremely large. 

We assume prior independence of the coefficients, with the understanding that the model 
could be reparameterized if there are places where prior correlation is appropriate. For 
each coefficient, we assume a Student-t prior distribution with mean 0, degrees-of-freedom 
parameter v, and scale s, with v and s chosen to provide minimal prior information to 
constrain the coefficients to lie in a reasonable range. We are motivated to consider the t 
family because flat-tailed distributions allow for robust inference (see Chapter 17). 

We can perform full Bayesian computation with this model using the usual MCMC 
methods. In addition we can use an approximate EM algorithm to construct a quick iterative 
calculation to get a point estimate of the regression coefficients and standard errors. From 
the standpoint of point estimation, the prior distribution has the purpose of stabilizing 
(regularizing) the estimates of otherwise unmodeled parameters. 


The problem of separation 


When a fitted model ‘separates’ (that is, perfectly predicts some subset of) discrete data, 
the maximum likelihood estimate can be undefined or unreasonable, and Bayesian inference 
under a uniform prior distribution can also fail. We illustrate with an example that arose in 
one of our routine analyses, where we were fitting logistic regressions to a series of datasets. 


Example. Predicting vote from sex, ethnicity, and income 

Table 16.1 shows the estimated coefficients in a model predicting probability of Re- 
publican vote for president given sex, ethnicity, and income (coded as —2, —1, 0, 1, 2), 
fit separately to pre-election polls for a series of elections. The estimates look fine 
except in 1964, where there is complete separation: of the 87 African Americans in 
the survey that year, none reported a preference for the Republican candidate. We fit 
the model in R, which actually yielded a finite estimate for the coefficient of black 
even in 1964, but that number and its standard error are essentially meaningless, be- 
ing a function of how long the iterative fitting procedure goes before giving up. The 
maximum likelihood estimate for the coefficient of black in that year is —oo. 
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1960 1968 
coef.est coef.se coef.est coef.se 
(Intercept) -0.06 0.10 (Intercept) 0.40 0.11 
female 0.24 0.14 female -0.03 0.15 
black -1.03 0.36 black -3.64 0.59 
income 0.03 0.06 income -0.03 0.07 
n = 875, k=4 n = 850, k = 4 
1964 1972 
coef.est coef.se coef.est coef.se 
(Intercept) -0.58 0.10 (Intercept) 0.95 0.09 
female -0.08 0.14 female -0.25 0.12 
black -16.83 420.40 black -2.58 0.26 
income 0.19 0.06 income 0.08 0.05 
n= 1059, k=4 n= 1518, k=4 


Table 16.1 Estimates and standard errors from logistic regressions (with uniform prior distribu- 
tions) predicting Republican vote intention in pre-election polls, fit separately to survey data from 
four presidential elections from 1960 through 1972. The estimates are reasonable except in 1964, 
where there is complete separation (with none of black respondents supporting the Republican can- 
didate, Barry Goldwater). 


Profile likelinood 


-15 -10 -5 0 5 
Coefficient for ‘black’ 


Figure 16.1 Profile likelihood (in this case, essentially the posterior distribution given a uniform 
prior distribution) of the coefficient of black from the logistic regression of Republican vote in 1964 
(displayed in the lower left of Table 16.1), conditional on point estimates of the other coefficients 
in the model. The maximum occurs as 3 —> —oo, indicating that the best fit to the data would occur 
at this unreasonable limit. 


Figure 16.1 displays the profile likelihood for the coefficient of black in the voting exam- 
ple, that is, the likelihood for this one parameter conditional on the other parameters 
being set to their maximum likelihood estimates conditional on this parameter. The 
maximum likelihood estimate (or, equivalently, the posterior mode under a uniform 
prior density) is at —co which makes no sense in this application. As we shall see, 
a weakly informative prior distribution will take the estimate down to a reasonable 
value. 


Computation with a specified normal prior distribution 


Working in the context of the logistic regression model, 


Pr(y;=1) = logit” *(X.8), (16.5) 
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we adapt the classical maximum likelihood algorithm to obtain approximate posterior in- 
ference for the coefficients 6, in the form of an estimate 6 and covariance matrix Vg. 

The standard logistic regression algorithm—upon which we build—proceeds by approx- 
imately linearizing the derivative of the log-likelihood, solving using weighted least squares, 
and then iterating this process, each step evaluating the derivatives at the latest estimate 
Ê. At each iteration, the algorithm determines pseudo-data z; and pseudo-variances (07)? 
based on the linearization of the derivative of the log-likelihood, as shown on page 410. 

The simplest informative prior distribution assigns normal distributions for the compo- 
nents of (3: 

Bi NGG ia, )y for j=1,...,J. (16.6) 


This information can be effortlessly included in the classical algorithm by simply altering 
the weighted least-squares step, augmenting the approximate likelihood with the prior dis- 
tribution. If the model has J coefficients 6; with independent N (uj, o?) prior distributions, 
then we add J pseudo-data points and perform weighted linear regression on ‘observations’ 
Zx, ‘explanatory variables’ X,, and weight vector wx, where 


Is ( : ). X,= ( a i we = (07), 07). (16.7) 


The vectors z,,w, and the matrix X, are constructed by combining the likelihood (z and 
o* are the vectors of z;’s and o7’s defined in (14.7), and X is the design matrix of the 
regression (16.5)) and the prior (u and ø are the vectors of u;’s and c;’s in (16.6), and Iy 
is the J x J identity matrix). As a result, z, and w, are vectors of length n+J and X, is 
an (n+J) x J matrix. With the augmented X,, this regression is identified, and thus the 
resulting estimate B is well defined and has finite variance, even if the original data have 
collinearity or separation that would result in nonidentifiability of the maximum likelihood 
estimate. 

The full computation is then iteratively weighted least squares, starting with a guess 
of 6, then computing the derivatives of the log-likelihood to compute z and øz, then us- 
ing weighted least squares on the pseudo-data (16.7) to yield an updated estimate of £, 
then recomputing the derivatives of the log-likelihood at this new value of 8, and so forth, 
converging to the estimate B . The covariance matrix Vg is simply the inverse second deriva- 
tive matrix of the log-posterior density evaluated at B—that is, the usual normal-theory 
uncertainty estimate for an estimate not on the boundary of parameter space. 


Approximate EM algorithm with a t prior distribution 


If the coefficients 8; have independent t prior distribution with centers uj and scales sj, 
we can adapt the just-described iteratively weighted least squares algorithm to estimate 
the coefficients using an approximate EM algorithm. We shall describe the steps of the 
algorithm shortly; the idea is to express the t prior distribution for each coefficient 8; as a 
mixture of normals with unknown scale øj, 


Êj os N(uj, 02), o? oa Inv-x? (vj, s2), (16.8) 


and then average over the §;’s at each step, treating them as missing data and performing 
the EM algorithm to estimate the o;’s. The algorithm proceeds by alternating one step of 
iteratively weighted least squares (as described above) and one step of EM. Once enough 
iterations have been performed to reach approximate convergence, we get an estimate and 
covariance matrix for the vector parameter 8 and the estimated o;’s. 

We initialize by setting each g; to the value s; (the scale of the prior distribution) and, 
as before, starting with a guess of 8 (either obtained earlier from a crude estimate or simply 
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picking a starting value such as 3 = 0). Then, at each step of the algorithm, we update o 
by maximizing the expected value of its (approximate) log-posterior density, 


Ï n 1 J 1 
logp(8, oly) ~ — 3 ‘3 cae - Xb}? = 3 D (a0 — py)? + cto 
{=l % 


— p(o;|v;, sj) + constant. (16.9) 


Each iteration of the algorithm proceeds as follows: 


1. Based on the current estimate of 8, perform the normal approximation to the log- 
likelihood and determine the vectors z and o* using (14.7), as in classical logistic re- 
gression computation. 


2. Approximate E-step: first run the weighted least squares regression based on the aug- 
mented data (16.7) to get an estimate 8 with variance matrix Vg. Then determine the 
expected value of the log-posterior density by replacing the terms (8; — uj)? in (16.9) by 


E ((8; — #5)" lo, y) = (Ê; — Hy)? + (Va)as, (16.10) 


which is only approximate because we are averaging over a normal distribution that is 
only an approximation to the generalized linear model likelihood. 


3. M-step: maximize the (approximate) expected value of the log-posterior density (16.9) 
to get the estimate, 


(Ê; — nj)? + (Vo)jj + vjs? 


52 
l liy 


(16.11) 


which corresponds to the (approximate) posterior mode of o? given a single measurement 
with value (16.10) and an Inv-x?(v;, s3) prior distribution. 


4. Recompute the derivatives of the log-posterior density given the current B, set up the 
augmented data (16.7) using the estimated ô from (16.11), and repeat steps 1,2,3 above. 


At convergence of the algorithm, we summarize the inferences using the latest estimate 6 
and covariance matrix Vg. 


Default prior distribution for logistic regression coefficients 


As noted above, we use independent t priors for the coefficients. Setting the scale ø to 
infinity (the convenient flat prior) fails under separation. Instead we want to add some 
prior information that constrains the estimate to be away from unrealistic extremes. 

A challenge in setting up any default prior distribution is getting the scale right: for 
example, suppose we are predicting vote preference given age (in years). We would not want 
the same prior distribution if the age scale were shifted to months. But discrete predictors 
have their own natural scale (most notably, a change of 1 in a binary predictor) that we 
would like to respect. 

The first step of our model is to standardize the input variables: 


e Binary inputs are shifted to have a mean of 0 and to differ by 1 in their lower and upper 
conditions. (For example, if a population is 10% black and 90% other, we would define 
the centered c.black variable to take on the values 0.9 and —0.1.) 


e Other inputs are shifted to have a mean of 0 and scaled to have a standard deviation of 
0.5. This scaling puts continuous variables on the same scale as symmetric binary inputs 
(which, taking on the values +0.5, have standard deviation 0.5). 
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Figure 16.2 (solid line) Cauchy density with scale 2.5, (dashed line) ty density with scale 2.5, 
(dotted line) likelihood for 6 corresponding to a single binomial trial of probability logit~'(0) with 
one-half success and one-half failure. All these curves favor values below 5 in absolute value; we 
choose the Cauchy as our default model because it allows the occasional probability of larger values. 


We distinguish between regression inputs and predictors. For example, in a regression on 
age, sex, and their interaction, there are four predictors (the constant term, age, sex, and 
age x sex), but just two inputs: age and sex. It is the input variables, not the predictors, 
that are standardized. 

A prior distribution on standardized variables depends on the data, but this is not 
necessarily a bad idea, given that the range of the data can provide information about the 
plausible scale of the coefficients. From a fully Bayesian perspective, we can think of our 
scaling procedure as an approximation to a hierarchical model in which the appropriate 
scaling parameter is estimated from the data. 

We center our default prior distributions at zero because, in the absence of any problem- 
specific information, we have no idea if the coefficients 8 will be positive or negative. We 
now need to choose default values of the scale s and degrees of freedom v in the t prior 
distributions on the coefficients of the rescaled coefficients. 

One way to pick a default value of v and s is to consider the baseline case of one-half of a 
success and one-half of a failure for a single binomial trial with probability p = logit~'(0)— 
that is, a logistic regression with only a constant term. The corresponding likelihood is 
e°/?/(1+e°), which is close to a t density function with 7 degrees of freedom and scale 2.5. 
We shall choose a slightly more conservative choice, the Cauchy, or t1, distribution, again 
with a scale of 2.5. Figure 16.2 shows the three density functions: they all give preference to 
values less than 5, with the Cauchy allowing the occasional possibility of very large values. 

We assign independent Cauchy prior distributions with center 0 and scale 2.5 to each 
of the coefficients in the logistic regression except the constant term. When combined with 
the standardization, this implies that the absolute difference in logit probability should be 
less then 5, when moving from one standard deviation below the mean, to one standard 
deviation above the mean, in any input variable. Adding 5 on the logit scale is equivalent 
to shifting the predicted probability from 0.01 to 0.5, or from 0.5 to 0.99; these changes 
are larger than anything we would expect in most applications. (For example, the lifetime 
probability of death from lung cancer is about 1% for a nonsmoking man in the United 
States, as compared to a probability of less than 50% for a heavy smoker.) 

If we were to apply the Cauchy prior distribution with center 0 and scale 2.5 to the 
constant term as well, we would be stating that the success probability is probably between 
1% and 99% for units that are average in all the inputs. Depending on the context (for 
example, epidemiologic modeling of rare conditions), this might not make sense, so as a 
default we apply a weaker prior distribution—a Cauchy with center 0 and scale 10, which 
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Figure 16.3 Estimates from maximum likelihood and Bayesian logistic regression with the recom- 
mended default prior distribution for the bioassay example (data in Table 3.1 on page 74). In 
addition to graphing the fitted curves (at right), we show raw computer output to illustrate how 
our approach would be used in routine practice. The big change in the estimated coefficient for z.x 
when going from glm to bayesglm may seem surprising at first, but upon reflection we prefer the 
second estimate with its lower coefficient for x, which is based on downweighting the most extreme 
possibilities that are allowed by the likelihood. 


implies that we expect the success probability for an average case to be between 107° and 
1—10~°. We typically have more information about the intercept than about any particular 
coefficient and so we can get by with a weaker prior. 


Other models 


Linear regression. Our algorithm is basically the same for linear regression, except that 
weighted least squares is an exact rather than approximate maximum penalized likelihood, 
and also a step needs to be added to estimate the data variance. In addition, we would 
preprocess y by rescaling the outcome variable to have mean 0 and standard deviation 0.5 
before assigning the prior distribution (or, equivalently, multiply the prior scale parameter 
by the standard deviation of the data). Separation is not a concern in linear regression; how- 
ever, when applied routinely (for example, in iterative imputation algorithms), collinearity 
can arise, in which case it is helpful to have a proper but weak prior distribution. 


Other generalized linear models. Again, the basic algorithm is unchanged, except that the 
pseudo-data and pseudo-variances in (14.7), which are derived from the first and second 
derivatives of the log-likelihood, are changed (see Section 16.2). For Poisson regression 
and other models with the logarithmic link, we would not often expect effects larger than 
5 on the logarithmic scale, and so the prior distributions given in this article might be 
a reasonable default choice. In addition, for models such as the negative binomial that 
have dispersion parameters, these can be estimated using an additional step as is done 
when estimating the data-level variance in normal linear regression. For more complex 
models such as multinomial logit and probit, we have considered combining independent t 
prior distributions on the coefficients with pseudo-data to identify cutpoints in the possible 
presence of sparse data. Such models also present computational challenges, as there is no 
simple existing iteratively weighted least squares algorithm for us to adapt. 


Bioassay example 


We next apply the default weakly informative prior distribution to the bioassay example 
from Section 3.7. Here these is no separation but the amount of information in the data is 
low, with only 20 binary data points. Table 3.1 on page 74 presents the data, from twenty 
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animals exposed to four different doses of a toxin. Figure 16.3 shows the resulting logistic 
regression, as fit first using maximum likelihood and then using our default Cauchy prior 
distributions with center 0 and scale 10 (for the constant term) and 2.5 (for the coefficient 
of dose). Following our general procedure, we have assigned this prior distribution after 
rescaling x to have mean 0 and standard deviation 0.5. 

With such a small sample, the prior distribution actually makes a difference, lowering 
the estimated coefficient of standardized dose from 10.2+ 6.4 to 5.4+ 2.2. (On the rescaled 
parameterization in which the predictor has mean 0 and standard deviation 0.5, the estimate 
goes from 7.7 + 4.9 to 4.4 + 1.9.) Such a large change might seem disturbing, but for the 
reasons discussed above, we would doubt the effect to be as large as 10.2 on the logistic 
scale, and the analysis shows these data to be consistent with the much smaller effect size 
of 5.4. The large amount of shrinkage simply confirms how weak the information is that 
gave the original maximum likelihood estimate. The graph at the upper right of Figure 16.3 
shows the comparison in a different way: the maximum likelihood estimate fits the data 
almost perfectly; however, the discrepancies between the data and the Bayes fit are small, 
considering the sample size of only 5 animals within each group. 


Example. Predicting voting from ethnicity (continued) 

We apply our default prior distribution to the pre-election polls discussed earlier in 
this section, in which we could easily fit a logistic regression with flat priors for every 
election year except 1964, where the coefficient for black blew up due to complete 
separation in the data. 

The left column of Figure 16.4 shows the time series of estimated coefficients and error 
bounds for the four coefficients of the logistic regression fit separately to poll data for 
each election year from 1952 through 2000. All the estimates are reasonable except 
for the intercept and the coefficient for black in 1964, when the maximum likelihood 
estimates are infinite. As discussed already, we do not believe that these coefficient 
estimates for 1964 are reasonable: in the population as a whole, we do not believe 
that the probability was zero that an African American in the population would vote 
Republican. It is, however, completely predictable that with moderate sample sizes 
there will occasionally be separation, yielding infinite estimates. (As noted earlier, 
the estimates shown here for 1964 are finite only because the generalized linear model 
fitting routine in R stops after a finite number of iterations.) 

The other three columns of Figure 16.4 show the coefficient estimates using our default 
Cauchy prior distribution for the coefficients, along with the t7 and normal distribu- 
tions. In all cases, the prior distributions are centered at 0, with scale parameters set 
to 10 for the constant term and 2.5 for all other coefficients. All three prior distribu- 
tions do a reasonable job at stabilizing the estimated coefficient for race for 1964, while 
leaving the estimates for other years essentially unchanged. This example illustrates 
how a weakly informative prior distribution can work in routine practice. 


Weakly informative default prior compared to actual prior information 


The Cauchy(0,2.5) prior distribution does not represent our prior knowledge about the 
coefficient for black in the logistic regression for 1964 or any other year. Even before 
seeing data from this particular series of surveys, we knew that blacks have been less likely 
than whites to vote for Republican candidates; thus our prior belief was that the coefficient 
was negative. Furthermore, our prior for any given year would be informed by data from 
other years. For example, given the series of estimates in Figure 16.4, we would guess 


lFor example, the second data point (log(a#) = —0.30) has an empirical rate of 1/5 = 0.20 and a 
predicted probability (from the Bayes fit) of 0.27. With a sample size of 5, we could expect a standard error 


of ,/0.27 - (1 — 0.27) /5 = 0.20, so a difference of 0.07 should be of no concern. 
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Figure 16.4 The left column shows the estimated coefficients (+1 standard error) for a logistic 
regression predicting probability of Republican vote for president given sex, race, and income, as 
fit separately to data from the National Election Study for each election 1952 through 2000. (The 
binary inputs female and black have been centered to have means of zero, and the numerical 
variable income has been centered and then rescaled by dividing by two standard deviations.) 

The complete separation in 1964 led to a coefficient estimate of —oo that year. (The particular 
finite values of the estimate and standard error are determined by the number of iterations used by 
the glm function in R before stopping.) 

The other columns show estimates for the same model fit each year using independent Cauchy, t7, 
and normal prior distributions, each with center 0 and scale 2.5. All three prior distributions do 
a reasonable job at stabilizing the estimates for 1964, while leaving the estimates for other years 
essentially unchanged. 


that the coefficient for black in 2004 is probably between —2 and —5. Finally, we are not 
using specific prior knowledge about these elections. The Republican candidate in 1964 
was particularly conservative on racial issues and opposed the Civil Rights Act; on those 
grounds we would expect him to poll particularly poorly among African Americans (as 
indeed he did). In sum, we feel comfortable using a default model that excludes much 
potentially useful information, recognizing that we could add such information if it were 
judged to be worth the trouble (for example, instead of performing separate estimates for 
each election year, setting up a hierarchical model allowing the coefficients to gradually 
vary over time and including election-level predictors including information such as the 
candidates’ positions on economic and racial issues). 
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16.4 Overdispersed Poisson regression for police stops 


There have been complaints in New York City and elsewhere that the police harass members 
of ethnic minority groups. In 1999, the New York State Attorney General’s Office instigated 
a study of the New York City police department’s ‘stop and frisk’ policy: the lawful practice 
of ‘temporarily detaining, questioning, and, at times, searching civilians on the street.’ The 
police have a policy of keeping records on stops, and we obtained all these records (about 
175,000 in total) for a fifteen-month period in 1998-1999. We analyzed these data to see 
to what extent different ethnic groups were stopped by the police. We focused on blacks 
(African Americans), hispanics (Latinos), and whites (European Americans). The ethnic 
categories were as recorded by the police making the stops. We excluded members of 
other ethnic groups (about 4% of the stops) because of the likelihood of ambiguities in 
classifications. (With such a low frequency of ‘other,’ even a small rate of misclassifications 
could cause large distortions in the estimates for that group. For example, if only 4% of 
blacks, hispanics, and whites were mistakenly labeled as ‘other,’ then this would nearly 
double the estimates for the ‘other’ category while having little effect on the three major 
groups. ) 


Aggregate data 


Blacks and hispanics represented 51% and 33% of the stops, respectively, despite comprising 
only 26% and 24%, respectively, of the population of the city. Perhaps a more relevant 
comparison, however, is to the number of crimes committed by members of each ethnic 
group. 

Data on actual crimes are not available, so as a proxy we used the number of arrests 
within New York City in 1997 as recorded by the Division of Criminal Justice Services 
(DCJS) of New York State. These were deemed to be the best available measure of local 
crime rates categorized by ethnicity. We used these numbers to represent the frequency of 
crimes that the police might suspect were committed by members of each group. When 
compared in that way, the ratio of stops to DCJS arrests was 1.24 for whites, 1.54 for blacks, 
and 1.72 for hispanics: based on this comparison, blacks are stopped 23% and hispanics 
39% more often than whites. 


Regression analysis to control for precincts 


The analysis so far looks at average rates for the whole city. Suppose the police make 
more stops in high-crime areas but treat the different ethnic groups equally within any 
locality. Then the citywide ratios could show strong differences between ethnic groups even 
if stops are entirely determined by location rather than ethnicity. In order to separate these 
two kinds of predictors, we performed hierarchical analyses using the city’s 75 precincts. 
Because it is possible that the patterns are systematically different in neighborhoods with 
different ethnic compositions, we divided the precincts into three categories in terms of their 
black population: precincts that were less than 10% black, 10%-40% black, and over 40% 
black. Each of these represented roughly 1/3 of the precincts in the city, and we performed 
separate analyses for each set. 

For each ethnic group e = 1,2,3 and precinct p, we modeled the number of stops Yep 
using an overdispersed Poisson regression with indicators for ethnic groups, a hierarchical 
model for precincts, and using Nep, the number of DCJS arrests for that ethnic group in 
that precinct (multiplied by 15/12 to scale to a fifteen-month period), as an offset: 


Yep X Poisson (Nepe: ter +eer) 
Bp ae N(0, 03) 
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Figure 16.5 Estimated rates exp(ac) at which people of different ethnic groups were stopped for 
different categories of crime, as estimated from hierarchical regressions (16.12) using previous year’s 
arrests as a baseline and controlling for differences between precincts. Separate analyses were done 
for the precincts that had less than 10%, 10%-40%, and more than 40% black population. For 
the most common stops—violent crimes and weapons offenses—blacks and hispanics were stopped 
about twice as often as whites. Rates are plotted on a logarithmic scale. 


€ep ~ N(0,0?), (16.12) 


where the coefficients @e’s control for ethnic groups, the 8,’s adjust for variation among 
precincts, and the €¢,’s allow for overdispersion. Of most interest are the exponentiated 
coefficients exp(a-), which represent relative rates of stops compared to arrests, after con- 
trolling for precinct. 

By comparing to arrest rates, we can also separately analyze stops associated with 
different sorts of crimes. We did a separate comparison for each of four types of offenses: 
violent crimes, weapons offenses, property crimes, and drug crimes. For each, we modeled 
the number of stops Yep by ethnic group e and precinct p for that crime type, using as a 
baseline the previous year’s DCJS arrest rates nep for that crime type. 

We thus estimated model (16.12) for twelve separate subsets of the data, corresponding 
to the four crime types and the three categories of precincts. We performed the computa- 
tions using Bugs (a predecessor to the program Stan described in Appendix C). Figure 16.5 
displays the estimated rates exp(@). For each type of crime, the relative frequencies of 
stops for the different ethnic groups, are in the same order for each set of precincts. We also 
performed an analysis including the month of arrest. Rates of stops were roughly constant 
over the 15-month period and did not add anything informative to the comparison of ethnic 
groups. 

Figure 16.5 shows that, for the most frequent categories of stops—those associated with 
violent crimes and weapons offenses—blacks and hispanics were much more likely to be 
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stopped than whites, in all categories of precincts. For violent crimes, blacks and hispanics 
were stopped 2.5 times and 1.9 times as often as whites, respectively, and for weapons 
crimes, blacks and hispanics were stopped 1.8 times and 1.6 times as often as whites. In the 
less common categories of stop, whites are slightly more often stopped for property crimes 
and more often stopped for drug crimes, in proportion to their previous year’s arrests in 
any given precinct. 

Does the overall pattern of disproportionate stops of minorities imply that the NYPD 
was acting in an unfair or racist manner? Not at all. It is reasonable to suppose that effective 
policing requires many people to be stopped and questioned in order to gather information 
about any given crime. In the context of some difficult relations between the police and 
ethnic minority communities in New York City, it is useful to have some quantitative sense 
of the issues under dispute. Given that there have been complaints about the frequency 
with which the police have been stopping blacks and hispanics, it is relevant to know that 
this is indeed a statistical pattern. The police department then has the opportunity to 
explain its policies to the affected communities. 


16.5 State-level opinons from national polls 


We illustrate the application of the analysis of variance (see Section 15.6) to hierarchical 
generalized linear models with a model of public opinion. 

Dozens of national opinion polls are conducted by media organizations before every 
election, and it is desirable to estimate opinions at the levels of individual states as well 
as for the entire country. These polls are generally based on national random-digit dialing 
with corrections for nonresponse based on demographic factors such as sex, ethnicity, age, 
and education. We estimated state-level opinions from these polls, while simultaneously 
correcting for nonresponse, in two steps. For any survey response of interest: 


1. We fit a regression model for the individual response y given demographics and state. 
This model thus estimates an average response 0; for each cross-classification j of de- 
mographics and state. In our example, we have sex (male or female), ethnicity (African 
American or other), age (4 categories), education (4 categories), and 50 states; thus 3200 
categories. 


2. From the U.S. Census, we get the adult population N; for each category j. The estimated 
population average of the response y in any state s is then 0, = ))j¢,.Nj9j/dijcs Nj; 
with each summation over the 64 demographic categories in the state. 

We need a large number of categories because (a) we are interested in separating out the 

responses by state, and (b) nonresponse adjustments force us to include the demographics. 

As a result, any given survey will have few or no data in many categories. This is not 

a problem, however, if a multilevel model is fitted. Each factor or set of interactions in 

the model corresponds to a row in the Anova plot and is automatically given a variance 

component. 

As discussed in the survey sampling literature, this inferential procedure works well 
and outperforms standard survey estimates when estimating state-level outcomes. For 
this example, we choose a single outcome—the probability that a respondent prefers the 
Republican candidate for President—as estimated by a logistic regression model from a set 
of seven CBS News polls conducted during the week before the 1988 presidential election. 
We focus here on the first stage of the estimation procedure—the inference for the logistic 
regression model—and use our Anova tools to display the relative importance of each factor 
in the model. 

We label the survey responses y; as 1 for supporters of the Republican candidate and 0 
for supporters of the Democrat (with undecideds excluded) and model them as independent, 
with Pr(y; = 1) = logit~'(X;8). The design matrix X is all 0’s and 1’s with indicators 
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Figure 16.6 Anova display for two logistic regression models of the probability that a survey respon- 
dent prefers the Republican candidate for the 1988 U.S. presidential election, based on data from 
seven CBS News polls. Point estimates and error bars show posterior medians, 50% intervals, 
and 95% intervals of the finite-population standard deviations sm. The demographic factors are 
those used by CBS to perform their nonresponse adjustments, and states and regions are included 
because we were interested in estimating average opinions by state. The large effects for ethnicity, 
region, and state suggest that it might make sense to include interactions, hence the inclusion of 
the ethnicity x region and ethnicity x state effects in the second model. 


for the demographic variables used by CBS in the survey weighting: sex, ethnicity, age, 
education, and the interactions of sex x ethnicity and age x education. We also include 
in X indicators for the 50 states and for the 4 regions of the country (northeast, midwest, 
south, and west). Since the states are nested within regions, no main effects for states 
are needed. As in our general approach for linear models, we give each batch of regression 
coefficients an independent normal distribution centered at zero and with standard deviation 
estimated hierarchically given a uniform prior density. 

We fitted the model and used posterior simulations to compute finite-sample standard 
deviations and plot the results. The left plot of Figure 16.6 displays the analysis of variance, 
which shows that ethnicity is by far the most important demographic factor, with state also 
explaining much of the variation. 

The natural next step is to consider interactions among the most important effects, as 
shown in the plot on the right side of Figure 16.6. The ethnicity x state x region inter- 
actions are surprisingly large: the differences between African Americans and others vary 
dramatically by state. As with the example in Section 15.6 of Internet connect times, the 
analysis of variance is a helpful tool in understanding the importance of different compo- 
nents in a hierarchical model. 


16.6 Models for multivariate and multinomial responses 
Multivariate outcomes 


We reanalyze the meta-analysis example of Section 5.6 using a binomial data model with a 
bivariate normal distribution for the parameters describing the individual studies. 


Example. Meta-analysis with binomial outcomes 

In this example, the results of each of 22 clinical trials are summarized by a 2 x 2 
table of death and survival under each of two treatments, and we are interested in the 
distribution of the effects of the treatment on the probability of death. The analysis 
of Section 5.6 was based on a normal approximation to the empirical log-odds ratio 
in each study. Because of the large sample sizes, the normal approximation is fairly 
accurate in this case, but it is desirable to have a more exact procedure for the general 
problem. 
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In addition, the univariate analysis in Section 5.6 used the ratio, but not the average, 
of the death rates in each trial; ignoring this information can have an effect, even with 
large samples, if the average death rates are correlated with the treatment effects. 


Data model. Continuing the notation of Section 5.6, let yi; be the number of deaths 
out of nj; patients for treatment i = 0,1 and study j = 1,...,22. As in the earlier 
discussion we take 7 = 1 to represent the treated groups, so that negative values of the 
log-odds ratio represent reduced frequency of death under the treatment. Our data 
model is binomial: 
Vij|Nij, Pig ~ Bin(nij, Pij), 

where p;j is the probability of death under treatment 7 in study j. We must now 
model the 44 parameters pij, which naturally follow a multivariate model, since they 
fall into 22 groups of two. We first transform the pij’s to the logit scale, so they are 
defined on the range (—oo, co) and can plausibly be fitted by the normal distribution. 


Hierarchical model in terms of transformed parameters. Rather than fitting a normal 
model directly to the parameters logit(p;;), we transform to the average and difference 
effects for each experiment: 


Biz = (logit(po;) + logit(pi;))/2 
Bo; = logit(p1;) — logit (po;). (16.13) 


The parameters (2; correspond to the 6;’s of Section 5.6. We model the 22 ex- 
changeable pairs (1;, 2;) as following a bivariate normal distribution with unknown 


parameters: 
o = bij a1) 4 
lara Is ( (i) (a) ) l 


This is equivalent to a normal model on the parameter pairs (logit(o;), logit(pı;)); 
however, the linear transformation should leave the 8’s roughly independent in their 
population distribution, making our inference less sensitive to the prior distribution 
for their correlation. 


Hyperprior distribution. We use the usual noninformative uniform prior distribution 
for the parameters a, and ag. For the hierarchical variance matrix A, there is no 
standard noninformative choice; for this problem, we assign independent uniform prior 
distributions to the variances Aj; and Ago and the correlation, p12 = A12/ (411422). 
The resulting posterior distribution is proper (see Exercise 16.9). 


Posterior computations. We drew samples from the posterior distribution in the usual 
way based on successive approximations, following the general strategy described in 
Part III. (This model could be fit easily using Stan, but we use this example to 
illustrate how such computations can be constructed directly.) The computational 
method we used here is almost certainly not the most efficient in terms of computer 
time, but it was relatively easy to program in a general way and yielded believable 
inferences. The model was parameterized in terms of 8, a, log(Ai1), log(Ag2), and 
Fisher’s z-transform of the correlation, 5 log (tS), to transform the ranges of the 
parameters to the whole real line. We sampled random draws from an approximation 
based on conditional modes, followed by importance resampling, to obtain starting 
points for ten parallel runs of the Metropolis algorithm. We used a normal jumping 
kernel with covariance from the curvature of the posterior density at the mode, scaled 
by a factor of 2.4/./49 (because the jumping is in 49-dimensional space; see page 
296). The simulations were run for 40,000 iterations, at which point the estimated 
scale reductions, R, for all parameters were below 1.2 and most were below 1.1. We use 
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Posterior quantiles 
Estimand 2.5% 25% median 75% 97.5% 
Study 1 avg logit, 611 —3.16 —2.67 2.42 2.21 1.79 
Study 1 effect, 82,1 —0.61 —0.33 —0.23 —0.13 0.14 
Study 2 effect, (2,2 —0.63 —0.37 —0.28 —0.19 0.06 
Study 3 effect, 82,3 —0.58  —0.35 —0.26 —0.16 0.08 
Study 4 effect, 82,4 —0.44 —0.30 0.24 0.17 0.03 
Study 5 effect, 82,5 —0.43 —0.27 —0.18 —0.08 0.16 
Study 6 effect, 82,6 —0.68 —0.37 —0.27 —0.18 0.04 
Study 7 effect, 82,7 —0.64 —0.47 0.38 0.31 0.20 
Study 8 effect, 82,8 —0.41 —0.27 —0.20 —0.11 0.10 
Study 9 effect, 82,9 —0.61 —0.37 0.29 0.21 0.01 
Study 10 effect, 82,10 —0.49 —0.36 0.29 0.23 0.12 
Study 11 effect, 82,11 —0.50 —0.31 —0.24 —0.16 0.01 
Study 12 effect, 82,12 —0.49 —0.32 —0.22 —0.11 0.13 
Study 13 effect, 62,13 —0.70 —0.37 —0.24 —0.14 0.08 
Study 14 effect, 82,14 —0.33 —0.18 —0.08 0.04 0.30 
Study 15 effect, 62,15 —0.58 —0.38 —0.28 —0.18 0.05 
Study 16 effect, 82,16 —0.52 —0.34 —0.25 —0.15 0.08 
Study 17 effect, 62,17 —0.49 —0.29 —0.20 —0.10 0.17 
Study 18 effect, 82,18 —0.54 —0.27 —0.17 —0.06 0.21 
Study 19 effect, B2,19 —0.56 —0.30 —0.18 —0.05 0.25 
Study 20 effect, 82,20 —0.57 —0.36 —0.26 —0.17 0.04 
Study 21 effect, 82,21 —0.65 —0.41 0.32 0.24 0.08 
Study 22 effect, 62,22 —0.66 —0.39 0.27 0.18 0.02 
mean of avg logits,ai —2.59 —2.42 2.34 2.26 2.09 
sd of avg logits, Ai: 0.39 0.48 0.55 0.63 0.83 
mean of effects, a2 —0.38 —0.29 0.24 0.20 0.11 
sd of effects, VA22 0.04 0.11 0.16 0.21 0.34 
correlation, p12 —0.61 —0.13 0.21 0.53 0.91 


Table 16.2 Summary of posterior inference for the bivariate analysis of the meta-analysis of the 
beta-blocker trials in Table 5.4. All effects are on the log-odds scale. Inferences are similar to the 
results of the univariate analysis of logit differences in Section 5.6: compare the individual study 
effects to Table 5.4 and the mean and standard deviation of average logits to Table 5.5. ‘Study 
1 avg logit’ is included above as a representative of the 22 parameters 81;. (We would generally 
prefer to display all these inferences graphically but use tables here to give a more detailed view of 
the posterior inferences.) 


the resulting 200,000 simulations of (3,a,A) from the second halves of the simulated 
sequences to summarize the posterior distribution in Table 16.2. 


Results from the posterior simulations. The posterior distribution for p12 is centered 
near 0.21 with considerable variability. Consequently, the multivariate model would 
have only a small effect on the posterior inferences obtained from the univariate analy- 
sis concerning the log-odds ratios for the individual studies or the relevant hierarchical 
parameters. Comparing the results in Table 16.2 to those in Tables 5.4 and 5.5 shows 
that the inferences are similar. The multivariate analysis based on the exact posterior 
distribution fixes any deficiencies in the normal approximation required in the previ- 
ous analysis but does not markedly change the posterior inferences for the quantities 
of essential interest. 
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Extension of the logistic link 


Appropriate models for polychotomous data can be developed as extensions of either the 
Poisson or binomial models. We first show how the logistic model for binary data can be 
extended to handle multinomial data. The notation for a multinomial random variable y; 
with sample size n; and k possible outcomes (that is, y; is a vector with De Yij = ni) is 
yi ~ Multin (ni; a1, ai2,..., Qik), with aij representing the probability of the jth category, 
and Da Qij = 1. A standard way to parameterize the multinomial generalized linear 
model is in terms of the logarithm of the ratio of the probability of each category relative 
to that of a baseline category, which we label j = 1, so that 


log(ai;/au1) = mj = Xi8, 


where 8%) is a vector of parameters for the jth category. The data distribution is then 


n k Nij Yij 
pul) x III (=) l 


kon, 
1 
i=1 j=1 Xie . 


with 6“) set equal to zero, and hence n;ı = 0 for each i. The vector 3%) indicates the effect 
of a change in X on the probability of observing outcomes in category j relative to category 
1. Often the linear predictor includes a set of indicator variables for the outcome categories 
indicating the relative frequencies when the explanatory variables take the default value 
X = 0; in that case we can write 6; as the coefficient of the indicator for category j and 
nij = 0; + XiB%, with 61 and 6“ typically set to 0. 


Special methods for ordered categories 


There is a distinction between multinomial outcomes with ordinal categories (for example, 
grades A, B, C, D) and those with nominal categories (for example, diseases). For ordinal 
categories the generalized linear model is often expressed in terms of cumulative probabilities 
(wij = X ı<j Ai) rather than category probabilities, with log(_) = 6; + X;8, where 


1-7; 


once again we typically take 6, = 6“) = 0. Due to the ordering of the categories, it may 
be reasonable to consider a model with a common set of regression parameters 3“) = 8 
for each j. Another common choice for ordered categories is the multinomial probit model. 
Either of these models can also be expressed in latent variable form as described at the end 
of Section 16.1 


Using the Poisson model for multinomial responses 


Multinomial response data can also be analyzed using Poisson models by conditioning on 
appropriate totals. As this method is useful in performing computations, we describe it 
briefly and illustrate its use with an example. Suppose that y = (y1,..., yz) are independent 
Poisson random variables with means A = (Aj,...,A,). Then the conditional distribution 
of y, given n = ae Yj, is multinomial: 


p(y|n, a) = Multin(y|n;ay,...,a%), (16.14) 


with a; = A,/ DR Ai. This relation can also be used to allow data with multinomial 
response variables to be fitted using Poisson generalized linear models. The constraint on 
the sum of the multinomial probabilities is imposed by incorporating additional covariates 
in the Poisson regression whose coefficients are assigned uniform prior distributions. We 
illustrate with an example. 
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White Black pieces 

pieces Kar Kas Kor Lju Sei Sho Spa Tal 
Karpov 1-0-1 1-0-0 0-1-1 1-0-0 0-0-2 0-0-0 0-0-0 
Kasparov 0-0-0 1-0-0 0-0-0 0-0-1 1-0-0 1-0-0 0-0-2 
Korchnoi 0-0-1 0-2-0 0-0-0 0-1-0 0-0-2 0-0-1 0-0-0 
Ljubojevic 0-1-0 0-1-1 0-0-2 0-1-0 0-0-1 0-0-2 0-0-1 
Seirawan 0-1-1 0-0-1 1-1-0 0-2-0 0-0-0 0-0-0 1-0-0 
Short 0-0-1 0-2-0 0-0-0 1-0-1 2-0-1 0-0-1 1-0-0 
Spassky 0-1-0 0-0-2 0-0-1 0-0-0 0-0-1 0-0-1 0-0-0 
Tal 0-0-2 0-0-0 0-0-3 0-0-0 0-0-1 0-0-0 0-0-1 


Table 16.3 Subset of the data from the 1988-1989 World Cup of chess: results of games between 
eight of the 29 players. Results are given as wins, losses, and draws; for ecample, when playing with 
the white pieces against Kasparov, Karpov had one win, no losses, and one draw. For simplicity, 
this table aggregates data from all six tournaments. 


Example. World Cup chess 

The 1988-1989 World Cup of chess consisted of six tournaments involving 29 of the 
world’s top chess players. Each tournament was a single round robin—each player 
played every other player exactly once—with 16 to 18 players. In total, 789 games 
were played; for each game in each tournament, the players, the outcome of the game 
(win, lose, or draw), and the identity of the player making the first move (thought to 
provide an advantage) are recorded. A subset of the data is displayed in Table 16.3. 


Multinomial model for paired comparisons with ties. A standard model for analyzing 
paired comparisons data, such as the results of a chess competition, is the Bradley- 
Terry model. In its most basic form, the model assumes that the probability that 
player i defeats player j is p;i; = exp(ai — a;)/(1 + exp(a; — a;)), where a;,a; are 
parameters representing player abilities. The parameterization using a; rather than 
yi = exp(a;) anticipates the generalized linear model approach. This basic model does 
not address the possibility of a draw nor the advantage of moving first; an extension 
of the model follows for the case when ¿i moves first: 


e 

j1 = Pr(i defeats 7/6) = 1 ———— 

Pijl r(i jl ) eai eci tI eo+3 (ai taj+7) 
Pr(j defeats i|0 — 

eo = } t 3 4 = ———— 

Pij2 rj sade i| ) eM p eli tY eta (aitaj+y) 
; — etz (aitajt+y) 

Piją = Pr(i draws with j|0) = ea (16.15) 


where y determines the relative advantage or disadvantage of moving first, and 6 
determines the probability of a draw. 


Parameterization as a Poisson regression model with logarithmic link. Let yijk, k = 
1,2,3, be the number of wins, losses, and draws for player i in games with player 
j where 7 had the first move. We create a Poisson generalized linear model that is 
equivalent to the desired multinomial model. The yijk are assumed to be independent 
Poisson random variables given the parameters. The mean of yijk iS Hijk = Nij Dijk; 
where nij is the number of games between players 7 and j where ¿i had the first move. 
The Poisson generalized linear model equates the logarithms of the means of the yi;x’s 
to a linear predictor. The logarithms of the means for the components of y are 


log Hiji = log Nig T Qi — Aij 


log Hij2 = log Nij T Qj + y= Aij 
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1 1 1 
log Hij3 = log Nij + ô + z% + a + 57 = Aix, (16.16) 


with A;; the logarithm of the denominator of the model probabilities in (16.15). The 
Aij terms allow us to impose the constraint on the sum of the three outcomes so 
that the three Poisson random variables describe the multinomial distribution; this is 
explained further below. 


Setting up the vectors of data and explanatory variables. To describe completely the 
generalized linear model that is suggested by the previous paragraph, we explain 
the various components in some detail. The outcome variable y is a vector of length 
3x 29x 28 containing the frequency of the three outcomes for each of the 29 x 28 ordered 
pairs (i, j). The mean vector is of the same length consisting of triples (piji, Hij2, Hija) 
as described above. The logarithmic link expresses the logarithm of the mean vector 
as the linear model X8. The parameter vector ( consists of the 29 player ability 
parameters (a;, i = 1,...,29), the first-move advantage parameter y, the draw pa- 
rameter 6, and the 29 x 28 nuisance parameters A;; that were introduced to create the 
Poisson model. The columns of the model matrix X can be obtained by examining 
the expressions (16.16). For example, the first column of X, corresponding to a1, is 1 
in any row corresponding to a win for player 1 and 0.5 in any row corresponding to a 
draw for player 1. Similarly the column of the model matrix X corresponding to 6 is 1 
for each row corresponding to a draw and 0 elsewhere. The final 29 x 28 submatrix of 
X corresponds to the parameters A;; which are not of direct interest. Each column is 
1 for the three rows that correspond to games between 7 and j for which 7 has the first 
move and is 0 elsewhere. When simulating from the posterior distribution, we do not 
sample the parameters A;,; instead they are used to ensure that yiji + Yija + Yij3 = Nij 
as required by the multinomial distribution. 


Using an offset to make the Poisson model correspond to the multinomial. According 
to the model (16.16), log n;; should be included as a column of the model matrix X 
with known coefficient (equal to 1). A predictor with known coefficient is known as 
an offset in the terminology of generalized linear models (see page 407). Assuming a 
noninformative prior distribution for all the model parameters, this Poisson generalized 
linear model for the chess data is overparameterized, in the sense that the probabilities 
specified by the model are unchanged if a constant is added to each of the a;. It is 
common to require a; = 0 to resolve this problem. Similarly, one of the A;; must be 
set to zero. A natural extension would be to treat the abilities as varying coefficients, 
in which case the restriction a; = 0 is no longer required. 


16.7 Loglinear models for multivariate discrete data 


A standard approach to describe association among several categorical variables uses the 
family of loglinear models. In a loglinear model, the response or outcome variable is mul- 
tivariate and discrete: the contingency table of counts cross-classified according to several 
categorical variables. The counts are modeled as Poisson random variables, and the loga- 
rithms of the Poisson means are described by a linear model incorporating indicator variables 
for the various categorical levels. Alternatively, the counts in each of several margins of the 
table may be modeled as multinomial random variables if the total sample size or some 
marginal totals are fixed by design. Loglinear models can be fitted as a special case of 
the generalized linear model. Why then do we include a separate section concerned with 
loglinear models? Basically because loglinear models are commonly used in applications 
with multivariate discrete data analysis—especially for multiple imputation (see Chapter 
18)—and because there is an alternative computing strategy that is useful when interest 
focuses on the expected counts and a conjugate prior distribution is used. 
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The Poisson or multinomial likelihood 


Consider a table of counts y = (y1,-.--,;Yn), where i = 1,...,m indexes the cells of the 
possibly multiway table. Let u = ({11,...,/n) be the vector of expected counts. The 
Poisson model for the counts y has the distribution, 


n 


Loy nts 
pul) = |] Fate a 


i=1 7° 


If the total of the counts is fixed by the design of the study, then a multinomial distribution 
for y is appropriate, as in (16.14). If other features of the data are fixed by design—perhaps 
row or column sums in a two-way table—then the likelihood might be the product of several 
independent multinomial distributions. (For example, consider a stratified sample survey 
constrained to include exactly 500 respondents from each of four geographic regions.) In the 
remainder of this section we discuss the Poisson model, with additional discussion where 
necessary to describe the modifications required for alternative models. 


Setting up the matrix of explanatory variables 


The loglinear model constrains the all-positive expected cell counts pu to fall on a regression 
surface, logy = Xp. The incidence matrix X is assumed known, and its elements are all 
zeros and ones; that is, all the variables in x are indicator variables. We assume that there 
are no ‘structural’ zeros—cells 7 for which the expected count u; is zero by definition, and 
thus log u; = —oo. (An example of a structural zero would be the category of ‘women with 
prostate cancer’ in a two-way classification of persons by sex and medical condition.) 

The choice of indicator variables to include depends on the important relations among 
the categorical variables. As usual, when assigning a noninformative prior distribution 
for the effect of a categorical variable with k categories, one should include only k — 1 
indicator variables. Interactions of two or more effects, represented by products of main 
effect columns, are used to model lack of independence. Typically a range of models is 
possible. 

The saturated model includes all variables and interactions; with the noninformative 
prior distribution, the saturated model has as many parameters as cells in the table. The 
saturated model has more practical use when combined with an informative or hierarchical 
prior distribution, in which case there are actually more parameters than cells because all 
k categories will be included for each factor. At the other extreme, the null model assigns 
equal probabilities to each cell, which is equivalent to fitting only a constant term in the 
hierarchical model. A commonly used simple model is independence, in which parameters 
are fitted to all one-way categories but no two-way or higher categories. With three cate- 
gorical variables z1, Z2, 23, the joint independence model has no interactions; the saturated 
model has all main (one-way) effects, two-way interactions, and the three-way interaction; 
and models in between are used to describe different degrees of association among variables. 
For example, in the loglinear model that includes z12z3 and z2z3 interactions but no others, 
zı and z2 are conditionally independent given z3. 


Prior distributions 


Conjugate Dirichlet family. The conjugate prior density for the expected counts u resem- 
bles the Dirichlet density: 


plu) x [| m, (16.17) 
=i 
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with the constraint that the cell counts fit the loglinear model; that is, p(y) = 0 unless 
some { exists for which logu = X8. For a Poisson sampling model, the usual Dirichlet 
constraint, )>;"_, 4; = 1, is not present. Densities from the family (16.17) with values of k; 
set between 0 and 1 are commonly used for noninformative prior specifications. 


Nonconjugate distributions. Nonconjugate prior distributions arise, for example, from hi- 
erarchical models in which parameters corresponding to high-order interactions are treated 
as exchangeable. Unfortunately, such models are not amenable to the special computational 
methods for loglinear models described below. 


Computation 


Finding the mode. As a first step to computing posterior inferences, we always recommend 
obtaining initial estimates of 6 using some simple procedure. In loglinear models, these 
crude estimates will often be obtained using standard loglinear model software with the 
original data supplemented by the ‘prior cell counts’ k; if a conjugate prior distribution is 
used as in (16.17). For some special loglinear models, the maximum likelihood estimate, 
and hence the expected counts, can be obtained in closed form. For the saturated model, 
the expected counts equal the observed counts, and for the null model, the expected counts 
are all equal to 5>;"_, y;/n. In the independence model, the estimates for the loglinear model 
parameters § are obtained directly from marginal totals in the contingency table. 

For more complicated models, however, the posterior modes cannot generally be ob- 
tained in closed form. In these cases, an iterative approach, iterative proportional fitting 
(IPF), can be used to obtain the estimates. In IPF, an initial estimate of the vector of 
expected cell counts, u, chosen to fit the model, is iteratively refined by multiplication by 
appropriate scale factors. For most problems, a convenient starting point is 6 = 0, so that 
u = 1 for all cells (assuming the loglinear model contains a constant term). 

At each step of IPF, the table counts are adjusted to match the model’s sufficient statis- 
tics (marginal tables). The iterative proportional fitting algorithm is generally expressed in 
terms of y, the factors in the multiplicative model, which are exponentials of the loglinear 
model coefficients: 

Yj = exp(B;). 
The prior distribution is assumed to be the conjugate Dirichlet-like distribution (16.17). 
Let 
Yj+ = X wig (yi + ki) 
v7 
represent the margin of the table corresponding to the jth column of X. At each step of 
the IPF algorithm, a single parameter is altered. The basic step, updating yj, assigns 


new Yj+ old 
i Se 
Then the expected cell counts are modified accordingly, 
new old (= a 
Hi = bi; old , (16.18) 
V 


rescaling the expected count in each cell ¿ for which x;; = 1. These two steps are repeated 
indefinitely, cycling through all of the parameters 7. The resulting series of tables converges 
to the mode of the posterior distribution of u given the data and prior information. Cells 
with values equal to zero that are assumed to have occurred by chance (random zeros as 
opposed to structural zeros) are generally not a problem unless a saturated model is fitted 
or all of the cells needed to estimate a model parameter are zero. The iteration is continued 
until the expected counts do not change appreciably. 
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Bayesian IPF. Bayesian computations for loglinear models apply a stochastic version of 
the IPF algorithm, a Gibbs sampler applied to the vector y. As in IPF, an initial estimate 
is required, the simplest choice being y; = 1 for all j, which corresponds to all expected cell 
counts equal to 1 (recall that the expected cell counts are just products of the y’s). The 
only danger in choosing the initial estimates, as with iterative proportional fitting, is that 
the initial choices cannot incorporate structure corresponding to interactions that are not 
in the model. The simple choice recommended above is always safe. At each step of the 
Bayesian IPF algorithm, a single parameter is altered. To update yj, we assign 


new __ A Yi+ “old 
Jj a 3 n ld /j ? 
2yj+ Doin Lig HE 


where A is a random draw from a x? distribution with 2y;, degrees of freedom. This step 
is identical to the usual IPF step except for the random first factor. Then the expected 
cell counts are modified using the formula (16.18). The two steps are repeated indefinitely, 
cycling through all of the parameters j. The resulting series of tables converges to a series 
of draws from the posterior distribution of u given the data and prior information. 

Bayesian IPF can be modified for non-Poisson sampling schemes. For example, for 
multinomial data, after each step of the Bayesian IPF, when expected cell counts are mod- 
ified, we just rescale the vector u to have the correct total. This amounts to using the 6; 
(or equivalently the yj) corresponding to the intercept of the loglinear model to impose the 
multinomial constraint. In fact, we need not sample this y; during the algorithm because it 
is used at each step to satisfy the multinomial constraint. Similarly, a product multinomial 
sample would have several parameters determined by the fixed marginal totals. 

The computational approach presented here can be used only for models of categorical 
measurements. If there are both categorical and continuous measurements, then more 
appropriate analyses are obtained using the normal linear model (Chapter 14) if the outcome 
is continuous, or a generalized linear model from earlier in this chapter if the outcome is 
categorical. 


16.8 Bibliographic note 


The term ‘generalized linear model’ was coined by Nelder and Wedderburn (1972), who 
modified Fisher’s scoring algorithm for maximum likelihood estimation. An excellent (non- 
Bayesian) reference is McCullagh and Nelder (1989). Hinde (1982) and Liang and McCul- 
lagh (1993) discuss models of overdispersion in generalized linear models and examine how 
they fit actual data. Gelman and Hill (2007) provide a computationally and graphically 
focused introduction to generalized linear models. 

Albert and Chib (1995) and Gelman, Goegebeur, et al. (2000) discuss Bayesian residual 
analysis and other model checks for discrete-data regressions; see also Landwehr, Pregibon, 
and Shoemaker (1984). 

Gelman, Jakulin, et al. (2008) discuss the weakly informative Cauchy prior distribution 
for logistic regression. 

Knuiman and Speed (1988) and Albert (1988) present Bayesian analyses of contingency 
tables based on analytic approximations. Bedrick, Christensen, and Johnson (1996) discuss 
prior distributions for generalized linear models. 

Dempster, Selwyn, and Weeks (1983) is an early example of fully Bayesian inference 
for logistic regression (using a normal approximation corrected by importance sampling 
to compute the posterior distribution for the hyperparameters). Zeger and Karim (1991), 
Karim and Zeger (1992), and Albert (1992) use Gibbs sampling to incorporate varying 
coefficients in generalized linear models. Dellaportas and Smith (1993) describe Gibbs 
sampling using the rejection method of Gilks and Wild (1992) to sample each component 
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of B conditional on the others; they show that this approach works well if the canonical 
link function is used. Albert and Chib (1993) perform Gibbs sampling for binary and 
polychotomous response data by introducing continuous latent scores. Gelfand and Sahu 
(1999) discuss Gibbs sampling for generalized linear models. Clayton and Bernardinelli 
(1992) consider hierarchical generalized linear models for disease mapping. Datta (1999) 
fit a hierarchical model to estimate state-level unemployment rates. 

Ormerod and Wand (2012) discuss variational Bayes computation for hierarchical gen- 
eralized linear models, and Cseke and Heskes (2011) discuss expectation propagation. 

The weakly informative prior distribution of Section 16.3 comes from Gelman, Jakulin, 
et al. (2008). Related work on separation and prior distributions in logistic regression 
includes Firth (1993), Raftery (1996b), Heinze and Schemper (2003), and Zorn (2005). 

The police stop-and-frisk study is described in Spitzer (1999) and Gelman, Fagan, and 
Kiss (2007). The pre-election polling example comes from Park, Gelman, and Bafumi 
(2004); see also Gelman and Little (1997) and Lax and Phillips (2009a,b) and, for related 
work, Gelman, Shor, et al. (2007). Gelman and Ghitza (2013) discuss the challenges of 
multilevel regression and poststratification for studying patterns in voting. Reilly, Gelman, 
and Katz present a time series model for poststratification on variables that are not observed 
the population. 

Belin et al. (1993) fit hierarchical logistic regression models to missing data in a census 
adjustment problem, performing approximate Bayesian computations using the ECM algo- 
rithm. This article also includes extensive discussion of the choices involved in setting up the 
model and the sensitivity to assumptions. Imai and van Dyk (2005) discuss Bayesian com- 
putation for unordered multinomial probit models. The basic paired comparisons model is 
due to Bradley and Terry (1952); the extension for ties and order effects is due to Davidson 
and Beaver (1977). Other references on models for paired comparisons include Stern (1990) 
and David (1988). The World Cup chess data are analyzed by Glickman (1993). John- 
son (1996) presents a Bayesian analysis of categorical data that was ordered by multiple 
raters, and Bradlow and Fader (2001) use hierarchical Bayesian methods to model parallel 
time series of rank data. Jackman (2001) and Martin and Quinn (2002) apply hierarchical 
Bayesian models to estimate ideal points from political data. Johnson (1997) presents a 
detailed example of a hierarchical discrete-data regression model for university grading; this 
article is accompanied by several interesting discussions. 

Books on loglinear models include Fienberg (1977) and Agresti (2002). Goodman (1991) 
provides a review of models and methods for contingency table analysis. See Fienberg 
(2000) for a recent overview. Good (1965) discusses a variety of Bayesian models, including 
hierarchical models, for contingency tables based on the multinomial distribution. 

Iterative proportional fitting was first presented by Deming and Stephan (1940). The 
Bayesian iterative proportional fitting algorithm was proposed by Gelman and Rubin (1991); 
a related algorithm using ECM to find the mode for a loglinear model with missing data 
appears in Meng and Rubin (1993). Dobra, Tebaldi, and West (2003) present recent work 
on Bayesian inference for contingency tables. 


16.9 Exercises 


1. Normal approximation for generalized linear models: derive equations (16.4). 
2. Computation for a simple generalized linear model: 


(a) Express the bioassay example of Section 3.7 as a generalized linear model and obtain 
posterior simulations using the computational techniques presented in Section 16.2. 

(b) Fit a probit regression model instead of the logit (you should be able to use essentially 
the same steps after altering the likelihood appropriately). Discuss any changes in the 
posterior inferences. 
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3. Overdispersed models: 
(a) Express the bioassay example of Section 3.7 as a generalized linear model, but replacing 
(3.14) by 
logit(0;) ~ N(a + Bz, o°), 


so that the logistic regression holds approximately but not exactly. Set up a nonin- 
formative prior distribution and obtain posterior simulations of (a, 8,0) under this 
model. Discuss the effect that this model expansion has on scientific inferences for 
this problem. 

(b) Repeat (a) with the following hypothetical data: n = (5000, 5000, 5000, 5000), y = 
(500, 1000, 3000, 4500), and x unchanged from the first column of Table 3.1. 

4. Computation for a hierarchical generalized linear model: 

(a) Express the rat tumor example of Section 5.1 as a generalized linear model and obtain 
posterior simulations using the computational techniques presented in Section 16.2. 

(b) Use the posterior simulations to check the fit of the model. 

5. Poisson model with overdispersion: Find data on counts of some event given some pre- 
dictor. 

(a) Fit a standard Poisson regression model relating the log of the expected count linearly 
to the predictor. 

(b) Perform some model checking on the simple model proposed in (a), and see if there is 
evidence of overdispersion. 

(c) Fit a hierarchical model assuming independent normally distributed errors. 

(d) Is there evidence that this model provides a better fit to the data? 

(e) Experiment with other forms of hierarchical model, in particular a mixture model 
that assumes a discrete prior distribution on two or three points for the errors, and 
perhaps also a t prior distribution. Explore the fit of the various models to the data 
and examine the sensitivity of inferences to the assumptions. 


See Hinde (1982) for background on these models. 


6. Fake-data simulation: Consider the following discrete-data model: y; ~ Poisson(e**®), i = 
1,...,”, with independent Cauchy prior distributions with location 0 and scale 2.5 on 
the elements of 8. 


(a) Write a program in R to apply the Metropolis algorithm for 6 given data X,y. Your 
program should work with any number of predictors (that is, X can be any matrix 
with the same number of rows as the length of y). 

(b) Simulate fake data from the model for a case with 50 data points and 3 predictors and 
run your program. Plot the posterior simulations from multiple chains and monitor 
convergence. 

(c) Now suppose you want to allow for overdispersion. In a sentence or two, explain why 
it typically makes sense to fit an overdispersed model in this setting. 

(d) Write a new model using the negative binomial distribution that is a sensible extension 
of the above Poisson model. Be careful about parameters and transformations! Write 
the model rigorously in statistical notation, and write an R function to compute the 
log posterior density. (We are not, however, asking you to program the Metropolis 
algorithm for this model or fit it to data.) 

7. Paired comparisons: consider the subset of the chess data in Table 16.3. 


(a) Perform a simple exploratory analysis of the data to estimate the relative abilities of 
the players. 

(b) Using some relatively simple (but reasonable) model, estimate the probability that 
player i wins if he plays White against player j, for each pair of players, (i, 7). 
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(c) Fit the model described in Section 16.6 and use it to estimate these probabilities. 
8. Iterative proportional fitting: 
(a) Prove that the IPF algorithm increases the posterior density for y at each step. 
(b) Prove that the Bayesian IPF algorithm is in fact a Gibbs sampler for the parameters 
Yj- 
9. Improper prior distributions and proper posterior distributions: consider the hierarchical 
model for the meta-analysis example in Section 16.6. 


(a) Show that, for any value of p12, the posterior distribution of all the remaining param- 
eters is proper, conditional on p12. 


(b) Show that the posterior distribution of all the parameters, including p12, is proper. 


10. Variational Bayes for probit regression: Set up and program variational Bayes for a 
probit regression with two coefficients (that is, Pr(y; = 1) = ®(a+6z;), for i = 1,...,n), 
using the latent-data formulation (so that z; ~ N(a + ba;,1) and y; = 1 if z;>0 and 0 
otherwise): 

(a) Write the log posterior density (up to an arbitrary constant), p(a, b, zy). 

(b) Assuming a variational approximation g that is independent in its n + 2 dimensions, 
determine the functional form of each of the factors in g. 

(c) Write the steps of the variational Bayes algorithm and program them in R. 


11. HMC for probit regression: Millions of people in rural Bangladesh are exposed to dan- 
gerous levels of arsenic in their drinking water which they get from home wells. Several 
years ago a survey was conducted in a small area of Bangladesh to see if people with 
high arsenic levels would be willing to switch to a neighbor’s well. File wells.dat has 
the data; all you need here are the variables switch (1 if the respondent said he or she 
would switch, 0 otherwise) and arsenic (the concentration in the respondent’s home 
well, with anything over 0.5 considered dangerous). Apply the probit model described 
in Exercise 15.10 to predict the probability of switching given arsenic level. The goal is 
inference for the coefficients a and b. 


(a) Program Hamiltonian Monte Carlo for the probit model, again using the latent- 
variable formulation (so you are jumping in a space of n+2 dimensions). Tune the 
algorithm and run to approximate convergence. 


Or, if you want less of a challenge, you can instead take the (relatively) easy way out 
and program Metropolis for this model and data. 


(b) Check your results by running Stan. 
(c) Compare to the variational Bayes inferences for a and b from Exercise 16.10. 
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Chapter 17 


Models for robust inference 


So far, we have relied primarily upon the normal, binomial, and Poisson distributions, and 
hierarchical combinations of these, for modeling data and parameters. The use of a limited 
class of distributions results, however, in a limited and potentially inappropriate class of 
inferences. Many problems fall outside the range of convenient models, and models should 
be chosen to fit the underlying science and data, not simply for their analytical or com- 
putational convenience. As illustrated in Chapter 5, often the most useful approach for 
creating realistic models is to work hierarchically, combining simple univariate models. If, 
for convenience, we use simplistic models, it is important to answer the following question: 
in what ways does the posterior inference depend on extreme data points and on unassess- 
able model assumptions? We have already discussed, in Chapter 6, the latter part of this 
question, which is essentially the subject of sensitivity analysis; here we return to the topic 
in greater detail, using more advanced computational methods. 


17.1 Aspects of robustness 
Robustness of inferences to outliers 


Models based on the normal distribution are notoriously ‘nonrobust’ to outliers, in the sense 
that a single aberrant data point can strongly affect the inference for all the parameters in 
the model, even those with little substantive connection to the outlying observation. 

For example, in the educational testing example of Section 5.5, our estimates for the 
eight treatment effects were obtained by shifting the individual school means toward the 
grand mean (or, in other words, shifting toward the prior information that the true effects 
came from a common normal distribution), with the proportionate shifting for each school 
j determined only by its sampling error, oj, and the variation T between school effects. 
Suppose that the observation for the eighth school in the study, yg in Table 5.2 on page 
120, had been 100 instead of 12, so that the eight observations were 28, 8, —3, 7, —1, 1, 18, 
and 100, with the same standard errors as reported in Table 5.2. If we were to apply the 
hierarchical normal model to this dataset, our posterior distribution would tell us that 7 
has a high value, and thus each estimate 6; would be essentially equal to its observed effect 
yj; see equation (5.17) and Figure 5.6. But does this make sense in practice? After all, 
given these hypothetical observations, the eighth school would seem to have an extremely 
effective SAT coaching program, or maybe the 100 is just the result of a data recording 
error. In either case, it would not seem right for the single observation yg to have such a 
strong influence on how we estimate 61,..., 07. 

In the Bayesian framework, we can reduce the influence of the aberrant eighth obser- 
vation by replacing the normal population model for the @;’s by a longer-tailed family 
of distributions, which allows for the possibility of extreme observations. By long-tailed, 
we mean a distribution with relatively high probability content far away from the center, 
where the scale of ‘far away’ is determined, for example, relative to the diameter of a region 
containing 50% of the probability in the distribution. Examples of long-tailed distributions 
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include the family of t distributions, of which the most extreme case is the Cauchy or t1, and 
also (finite) mixture models, which generally use a simple distribution such as the normal 
for the bulk of values but allow a discrete probability of observations or parameter values 
from an alternative distribution that can have a different center and generally has a much 
larger spread. In the hypothetical modification of the SAT coaching example, performing 
an analysis using a long-tailed distribution for the 6;’s would result in the observation 100 
being interpreted as arising from an extreme draw from the long-tailed distribution rather 
than as evidence that the normal distribution of effects has a high variance. The resulting 
analysis would shrink the eighth observation somewhat toward the others, but not nearly 
as much (relative to its distance from the overall mean) as the first seven are shrunk toward 
each other. (Given this hypothetical dataset, the posterior probability Pr(@g > 100|y) should 
presumably be somewhat less than 0.5, and this justifies some shrinkage.) 


As our hypothetical example indicates, we do not have to abandon Bayesian principles 
to handle outliers. For example, a long-tailed model such as a Cauchy distribution or even 
a two-component mixture (see Exercise 17.1) is still an exchangeable prior distribution for 
(0;,...,0g), as is appropriate when there is no prior information distinguishing among the 
eight schools. The choice of exchangeable prior model affects the manner in which the 
estimates of the 6;’s are shrunk, and we can thereby reduce the effect of an outlying ob- 
servation without having to treat it in a fundamentally different way in the analysis. (This 
should not replace careful examination of the data and checking for possible recording er- 
rors in outlying values.) A distinction is sometimes made between methods that search 
for outliers—possibly to remove them from the analysis—and robust procedures that are 
invulnerable to outliers. In the Bayesian framework, the two approaches should not be dis- 
tinguished. For instance, using mixture models (either finite mixture models as in Chapter 
22 or overdispersed versions of standard models) not only results in categorizing extreme 
observations as arising from high-variance mixture components (rather than simply sur- 
prising ‘outliers’) but also implies that these points have less influence on inferences for 
estimands such as population means and medians. 


Sensitivity analysis 


In addition to compensating for outliers, robust models can be used to assess the sensitivity 
of posterior inferences to model assumptions. For example, one can use a robust model that 
applies the ¢ in place of a normal distribution to assess sensitivity to the normal assumption 
by varying the degrees of freedom from large to small. As discussed in Chapter 6, the basic 
idea of sensitivity analysis is to try a variety of different distributions (for likelihood and 
prior models) and see how posterior inferences vary for estimands and predictive quantities 
of interest. Once samples have already been drawn from the posterior distribution under 
one model, it is often straightforward to draw from alternative models using importance 
resampling with enough accuracy to detect major differences in inferences between the 
models (see Section 17.3). If the posterior distribution of estimands of interest is highly 
sensitive to the model assumptions, iterative simulation methods might be required for more 
accurate computation. 


In a sense, much of the analysis of the SAT coaching experiments in Section 5.5, espe- 
cially Figures 5.6 and 5.7, is a sensitivity analysis, in which the parameter 7 is allowed to 
vary from 0 to oo. As discussed in Section 5.5, the observed data are actually consistent 
with the model of all equal effects (that is, r = 0), but that model makes no substantive 
sense, so we fit the model allowing 7 to be any positive value. The result is summarized in 
the marginal posterior distribution for 7 (shown in Figure 5.5), which describes a range of 
values of 7 that are supported by the data. 


This electronic edition is for non-commercial purposes only. 


17.2. OVERDISPERSED VERSIONS OF STANDARD MODELS 437 
17.2 Overdispersed versions of standard models 


Sometimes it will appear natural to use one of the standard models—binomial, normal, 
Poisson, exponential—except that the data are too dispersed. For example, the normal 
distribution should not be used to fit a large sample in which 10% of the points lie a distance 
more than 1.5 times the interquartile range away from the median. In the hypothetical 
example of the previous section we suggested that the prior or population model for the 8 
should have longer tails than the normal. For each of the standard models, there is in fact 
a natural extension in which a single parameter is added to allow for overdispersion. Each 
of the extended models has an interpretation as a mixture distribution. 

A feature of all these distributions is that they can never be underdispersed. This makes 
sense in light of formulas (2.7) and (2.8) and the mixture interpretations: the mean of the 
generalized distribution is equal to that of the underlying family, but the variance is higher. 
If the data are believed to be underdispersed relative to the standard distribution, different 
models should be used. 


The t distribution in place of the normal 


The ¢ distribution has a longer tail than the normal and can be used for accommodating (1) 
occasional unusual observations in a data distribution or (2) occasional extreme parameters 
in a prior distribution or hierarchical model. The t family of distributions—t, (u, 07)—is 
characterized by three parameters: center u, scale o, and a ‘degrees of freedom’ parameter 
v that determines the shape of the distribution. The t densities are symmetric, and v must 
fall in the range (0,00). At v = 1, the t is equivalent to the Cauchy distribution, which is so 
long-tailed it has infinite mean and variance, and as v — oo, the t approaches the normal 
distribution. If the t distribution is part of a probability model attempting accurately to fit 
a long-tailed distribution, based on a reasonably large quantity of data, then it is generally 
appropriate to include the degrees of freedom as an unknown parameter. In applications for 
which the t is chosen simply as a robust alternative to the normal, the degrees of freedom 
can be fixed at a small value to allow for outliers, but no smaller than prior understanding 
dictates. For example, t’s with one or two degrees of freedom have infinite variance and are 
not usually realistic in the far tails. 


Mixture interpretation. Recall from Sections 3.2 and 12.1 that the t, (u, o°) distribution 
can be interpreted as a mixture of normal distributions with a common mean and variances 
distributed as scaled inverse-y?. For example, the model y; ~ ty (u, o°) is equivalent to 


yilVi ~ N(u, Vi) 
Vi; ~ Inv-x?(v,07), (17.1) 


an expression we have already introduced as (12.1) on page 294 to illustrate the compu- 
tational methods of auxiliary variables and parameter expansion. Statistically, the obser- 
vations with high variance can be considered the outliers in the distribution. A similar 
interpretation holds when modeling exchangeable parameters 6;. 


Negative binomial alternative to Poisson 


A common difficulty in applying the Poisson model to data is that the Poisson model requires 
that the variance equal the mean; in practice, distributions of counts often are overdispersed, 
with variance greater than the mean. We have already discussed overdispersion in the 
context of generalized linear models (see Section 16.1), and Section 16.4 gives an example 
of a hierarchical normal model for overdispersed Poisson regression. 

Another way to model overdispersed count data is using the negative binomial distri- 
bution, a two-parameter family that allows the mean and variance to be fitted separately, 
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with variance at least as great as the mean. Data y1,...,Yyn that follow a Neg-bin(a, 8) 
distribution can be thought of as Poisson observations with means A1,..., An, which follow 
a Gamma(a, 8) distribution. The variance of the negative binomial distribution is ous, 
which is always greater than the mean, 4, in contrast to the Poisson, whose variance is 
always equal to its mean. In the limit as 8 —> oo with $ remaining constant, the underlying 
gamma distribution approaches a spike, and the negative binomial distribution approaches 


the Poisson. 


Beta-binomial alternative to binomial 


Similarly, the binomial model for discrete data has the practical limitation of having only 
one free parameter, which means the variance is determined by the mean. A standard robust 
alternative is the beta-binomial distribution, which, as the name suggests, is a beta mixture 
of binomials. The beta-binomial is used, for example, to model educational testing data, 
where a ‘success’ is a correct response, and individuals vary greatly in their probabilities 
of getting a correct response. Here, the data y;—the number of correct responses for each 
individual i = 1,...,n—are modeled with a Beta-bin(m, a, 8) distribution and are thought 
of as binomial observations with a common number of trials m and unequal probabilities 
T™1,---,7m that follow a Beta(a, 8) distribution. The variance of the beta-binomial with 
mean probability —+; is greater by a factor of ea than that of the binomial with the 
same probability; see Table A.1 in Appendix A. When m = 1, no information is available 
to distinguish between the beta and binomial variation, and the two models have equal 
variances. 


The t distribution alternative to logistic and probit regression 


Logistic and probit regressions can be nonrobust in the sense that for large absolute values 
of the linear predictor X p, the inverse logit or probit transformations give probabilities close 
to 0 or 1. Such models could be made more robust by allowing the occasional misprediction 
for large values of X 8. This form of robustness is defined not in terms of the data y—which 
equal 0 or 1 in binary regression—but with respect to the predictors X. A more robust 
model allows the discrete regression model to fit most of the data while occasionally making 
isolated errors. 

A robust model, robit regression, can be implemented using the latent-variable formu- 
lation of discrete-data regression models (see page 408), replacing the logistic or normal 
distribution of the latent continuous data u with the model, u; ~ t,((X8);, 1). In realistic 
settings it is impractical to estimate v from the data—since the latent data u; are never 
directly observed, it is essentially impossible to form inference about the shape of their 
continuous underlying distribution—-so it is set at a low value to ensure robustness. Setting 
v = 4 yields a distribution that is close to the logistic, and as v > oo, the model approaches 
the probit. Computation for the binary t regression can be performed using the EM algo- 
rithm and Gibbs sampler with the normal-mixture formulation (17.1) for the t distribution 
of the latent data u. In that approach, u; and the variance of each u; are treated as missing 
data. 


Why ever use a nonrobust model? 


The t family includes the normal as a special case, so why do we ever use the normal at all, or 
the binomial, Poisson, or other standard models? To start with, each of the standard models 
has a logical status that makes it plausible for many applied problems. The binomial and 
multinomial distributions apply to discrete counts for independent, identically distributed 
outcomes with a fixed total number of counts. The Poisson and exponential distributions fit 
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the number of events and the waiting time for a Poisson process, which is a natural model 
for independent discrete events indexed by time. Finally, the central limit theorem tells us 
that the normal distribution is an appropriate model for data that are formed as the sum of 
a large number of independent components. In the educational testing example in Section 
5.5, each of the observed effects, y;, is an average of adjusted test scores with n; œ 60 (that 
is, the estimated treatment effect is based on about 60 students in school j). We can thus 
accurately approximate the sampling distribution of yj by normality: y;|0;,07 ~ N(0;,03). 

Even when they are not naturally implied by the structure of a problem, the standard 
models are computationally convenient, since conjugate prior distributions often allow direct 
calculation of posterior means and variances and easy simulation. That is why it is easy 
to fit a normal population model to the 0;’s in the educational testing example and why it 
is common to fit a normal model to the logarithm of all-positive data or the logit of data 
that are constrained to lie between 0 and 1. When a model is assigned in this more or less 
arbitrary manner, it is advisable to check the fit of the data using the posterior predictive 
distribution, as discussed in Chapter 6. But if we are worried that an assumed model is not 
robust, then it makes sense to perform a sensitivity analysis and see how much the posterior 
inference changes if we switch to a larger family of distributions, such as the t distributions 
in place of the normal. 


17.3 Posterior inference and computation 


As always, we can draw samples from the posterior distribution (or distributions, in the case 
of sensitivity analysis) using the methods described in Part III. In this section, we briefly 
describe the use of Gibbs sampling under the mixture formulation of a robust model. The 
approach is illustrated for a hierarchical normal-t model in Section 17.4. When expanding 
a model, however, we have the possibility of a less time-consuming approximation as an 
alternative: we can use the draws from the original posterior distribution as a starting point 
for simulations from the new models. In this section, we also describe two techniques that 
can be useful for robust models and sensitivity analysis: importance weighting for computing 
the marginal posterior density in a sensitivity analysis, and importance resampling (Section 
10.4) for approximating a robust analysis. 


Notation for robust model as expansion of a simpler model 


We use the notation po(@|y) for the posterior distribution from the original model, which we 
assume has already been fitted to the data, and ¢ for the hyperparameter(s) characterizing 
the expanded model used for robustness or sensitivity analysis. Our goal is to sample from 


pOl, y) x P(AlP)p(yl4, o), (17.2) 


using either a pre-specified value of ¢ (such as v = 4 for a robust t model) or for a range of 
values of ¢. In the latter case, we also wish to compute the marginal posterior distribution 
of the sensitivity analysis parameter, p(dly). 


The robust family of distributions can enter the model (17.2) through the distribution 
of the parameters, p(6|¢), or the data distribution, p(y|@,¢). For example, Section 17.2 
focuses on robust data distributions, and our reanalysis of the SAT coaching experiments 
in Section 17.4 uses a robust distribution for model parameters. We must then set up a 
joint prior distribution, p(6,%), which can require some care because it captures the prior 
dependence between 0 and ¢. 
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Gibbs sampling using the mixture formulation 


Markov chain simulation can be used to draw from the posterior distributions, p(6|¢, y). 
This can be done using the mixture formulation, by sampling from the joint posterior 
distribution of 0 and the extra unobserved scale parameters (V;’s in the t model, \,;’s in the 
negative binomial, and 7;’s in the beta- aaa 

For a simple example, consider the t, (u, o°) distribution fitted to data y1,...,Yn, with 
u and g unknown. Given v, we have cae discussed in Section 12.1 how to program the 
Gibbs sampler in terms of the parameterization (17.1) involving ,07, Vi,...,Vn. If v is 
itself unknown, the Gibbs sampler must be expanded to include a step for sampling from 
the conditional posterior distribution of v. No simple method exists for this step, but a 
Metropolis step can be used instead. Another complication is that such models commonly 
have multimodal posterior densities, with different modes corresponding to different obser- 
vations in the tails of the ¢ distributions, meaning that additional work is required to search 
for modes initially and jump between modes in the simulation, for example using simulated 
tempering (see Section 12.3). 


Sampling from the posterior predictive distribution for new data 


To perform sensitivity analysis and robust inference for predictions y, follow the usual pro- 
cedure of first drawing 0 from the posterior distribution, p(6|¢, y), and then drawing y from 
the predictive distribution, p(y|¢, 0). To simulate data from a mixture model, first draw the 
mixture indicators for each future observation, then draw y, given the mixture parameters. 
For example, to draw g from a t,(,07) distribution, first draw V ~ Inv-y?(v,07), then 
draw J ~ N(u, V). 


Computing the marginal posterior distribution of the hyperparameters by importance 
weighting 


During a check for model robustness or sensitivity to assumptions, we might like to avoid 
the additional programming effort required to apply Markov chain simulation to a robust 
model. If we have simulated draws from po(6|y), then it is possible to obtain approximate 
inference under the robust model using importance weighting and importance resampling. 
We assume in the remainder of this section that simulation draws 6°,s = 1,...,5, have 
already been obtained from po(6|y). We can use importance weighting to evaluate the 
marginal posterior distribution, p(¢|y), using identity (13.11) on page 326, which in our 
current notation becomes 


(sly) x p(o) [ PLP) PIO. ©) (6) po(yl8)d9 


0(9)po (yl) 
) 


ro | eet ay Poll) 


In the first line above, the proportionality constant is 1/p(y), whereas in the second it is 
po(y)/p(y). For any ¢, the value of p(¢|y), up to a constant factor, can be estimated by the 
average importance ratio for the simulations 6°, 


i z ylðS, o 
S& pol a )Po aoe S 

which can be evaluated, using a fixed set of S simulations, at each of a range of values of 
@, and then graphed as a function of ¢. 
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Approximating the robust posterior distributions by importance resampling 


To perform importance resampling, it is best to start with a large number of draws, say 
S = 5000, from the original posterior distribution, po(@|y). Now, for each distribution in 
the expanded family indexed by ¢, draw a smaller subsample, say k = 500, from the S$ 
draws, without replacement, using importance resampling, in which each of the k samples 
is drawn with probability proportional to its importance ratio, 


P(Ald,y) _ Pld)r(ylé, g) 


po(0ly) o po(0)po(yl|0) 


A new set of subsamples must be drawn for each value of ¢, but the same set of S original 
draws may be used. Details are given in Section 10.4. This procedure is effective as long as 
the largest importance ratios are plentiful and not too variable; if they do vary greatly, this 
is an indication of potential sensitivity because p(@|¢, y)/po(O|y) is sensitive to the drawn 
values of 0. If the importance weights are too variable for importance resampling to be 
considered accurate, and accurate inferences under the robust alternatives are desired, then 
we must rely on Markov chain simulation. 


17.4 Robust inference for the eight schools 


Consider the hierarchical model for SAT coaching effects based on the data in Table 5.2 in 
Section 5.5. Given the large sample sizes in the eight original experiments, there should be 
little concern about assuming the data model that has y; ~ N(6;,07), with the variances oF 
known. The population model, 6; ~ N(u, 7), is more difficult to justify, although the model 
checks in Section 6.5 suggest that it is adequate for the purposes of obtaining posterior 
intervals for the school effects. In general, however, posterior inferences can be highly 
sensitive to the assumed model, even when the model provides a good fit to the observed 
data. To illustrate methods for robust inference and sensitivity analysis, we explore an 
alternative family of models that fit t distributions to the population of school effects: 


O;|v, u, T ~ tulu, T’), for j=1,...,8. (17.4) 


We use the notation p(8, u, Tv, y) x pO, u, T\v)plylð, u, T, v) for the posterior distribution 
under the t, model and po(0, u, T|y) = p(0,u,T|V = œ,y) for the posterior distribution 
under the normal model evaluated in Section 5.5. 


Robust inference based on a t4 population distribution 


As discussed at the beginning of this chapter, one might be concerned that the normal 
population model causes the most extreme estimated school effects to be pulled too much 
toward the grand mean. Perhaps the coaching program in school A, for example, is different 
enough from the others that its estimate should not be shrunk so much to the average. A 
related concern would be that the largest observed effect, in school A, may be exerting 
undue influence on estimation of the population variance, 77, and thereby also on the 
Bayesian estimates of the other effects. From a modeling standpoint, there is a great 
variety of different SAT coaching programs, and the population of their effects might be 
better fitted by a long-tailed distribution. To assess the importance of these concerns, we 
perform a robust analysis, replacing the normal population distribution by the t model 
(17.4) with v = 4 and leaving the rest of the model unchanged; that is, the likelihood is 
still p(yl@,v) = JJ; N(y;l0;, +), and the hyperprior distribution is still p(u,7|v) œ 1. 

Gibbs sampling. We carry out Gibbs sampling using the approach described in Section 12.1 
with v = 4. (See Appendix C for details on fitting such a model in Stan or performing the 
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School Posterior quantiles 

2.5% 25% median 75% 97.5% 
A —2 6 11 16 34 
B —5 4 8 12 21 
C —14 2 T 11 21 
D —6 4 8 12 21 
E —9 1 6 9 17 
F -9 3 T 10 19 
G —1 6 10 15 26 
H —8 4 8 13 26 


Table 17.1 Summary of 2500 simulations of the treatment effects in the eight schools, using the 
ta population distribution in place of the normal. Results are similar to those obtained under the 
normal model and displayed in Table 5.3. 


Gibbs sampler in R.) The resulting inferences for the eight schools, based on 2500 draws 
from the posterior distribution (the last halves of five chains, each of length 1000), are 
provided in Table 17.1. The results are essentially identical, for practical purposes, to the 
inferences under the normal model displayed in Table 5.3 on page 123, with just slightly 
less shrinkage for the more extreme schools such as school A. 


Computation using importance resampling. Though we have already done the Markov 
chain simulation, we discuss briefly how to apply importance resampling to approximate the 
posterior distribution with v = 4. First, we sample 5000 draws of (@, u, T) from po(9, u, 7|y), 
the posterior distribution under the normal model, as described in Section 5.4. Next, we 
compute the importance ratio for each draw: 


POs 79) Ple TARON Tl HoT) _ yy Oale) ars) 

po(9, u, Tly) Po(H, T)po(8|u, T)po(ylð, u, T) i l 
The factors for the likelihood and hyperprior density cancel in the importance ratio, leav- 
ing only the ratio of the population densities. We sample 500 draws of (0, u, T), without 
replacement, from the sample of 5000, using importance resampling. In this case the approx- 
imation is probably sufficient for assessing robustness, but the long tail of the distribution 
of the logarithms of the importance ratios (not shown) does indicate serious problems for 
obtaining accurate inferences using importance resampling. 


Sensitivity analysis based on t, distributions with varying values of v 


A slightly different concern from robustness is the sensitivity of the posterior inference to 
the prior assumption of a normal population distribution. To study the sensitivity, we now 
fit a range of t distributions, with 1, 2, 3, 5, 10, and 30 degrees of freedom. We have already 
fitted infinite degrees of freedom (the normal model) and 4 degrees of freedom (the robust 
model above). 

For each value of v, we perform a Markov chain simulation to obtain draws from 
p(0, u, T|v, y). Instead of displaying a table of posterior summaries such as Table 17.1 for 
each value of v, we summarize the results by the posterior mean and standard deviation of 
each of the eight school effects 0;. Figure 17.1 displays the results as a function of 1. The 
parameterization in terms of 4 rather than v has the advantage of including the normal 
distribution at 4 = 0 and encompassing the entire range from normal to Cauchy distri- 
butions in the finite interval [0,1]. There is some variation in the figures but no apparent 
systematic sensitivity of inferences to the hyperparameter, v. 
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Figure 17.1 Posterior means and standard deviations of treatment effects as functions of v, on the 

scale of 1/v, for the sensitivity analysis of the educational testing example. The values at 1/v=0 

come from the simulations under the normal distribution in Section 5.5. Much of the scatter in the 

graphs is due to simulation variability. 


Treating v as an unknown parameter 


Finally, we consider the sensitivity analysis parameter, v, as an unknown quantity and 
average over it in the posterior distribution. In general, this computation is a key step, 
because we are typically only concerned with sensitivity to models that are supported by 
the data. In this particular example, inferences are so insensitive to v that computing the 
marginal posterior distribution is unnecessary; we include it here as an illustration of the 
general method. 


Prior distribution. Before computing the posterior distribution for v, we must assign it 
a prior distribution. We try a uniform density on + for the range [0,1] (that is, from the 
normal to the Cauchy distributions). This prior distribution favors long-tailed models, with 
half of the prior probability falling between the tı (Cauchy) and t2 distributions. 

In addition, the conditional prior distributions, p(u,7|v) x 1, are improper, so we must 
specify their dependence on v; we use the notation p(y,T|v) x g(v). In the t family, the 
parameters u and 7 characterize the median and the second derivative of the density function 
at the median, not the mean and variance, of the distribution of the 6;’s. The parameter 
u seems to have a reasonable invariant meaning (and in fact is equal to the mean except 
in the limiting case of the Cauchy where the mean does not exist), but the interquartile 
range would perhaps be a more reasonable parameter than the curvature for setting up a 
prior distribution. We cannot parameterize the t, distributions in terms of their variance, 
because the variance is infinite for v < 2. The interquartile range varies mildly as a function 
of v, and so for simplicity we use the convenient parameterization in terms of (u, T) and set 
g(v) x 1. Combining this with our prior distribution on v yields an improper joint uniform 
prior density on (1,7, +). If our posterior inferences under this model turn out to depend 
strongly on v, we should consider refining this prior distribution. 


Posterior inference. To treat v as an unknown parameter, we modify the Gibbs sampling 
simulation used in the robust analyses to include a Metropolis step for sampling from the 
conditional distribution of t, An example of the implementation of such an approach can 
be found in Appendix C. Figure 17.2 displays a histogram of the simulations of 1. An 
alternative to extending the model is to approximate the marginal posterior density using 
importance sampling and (17.3). 

The sensitivity analysis showed that v has only minor effects on the posterior inference; 
the results in Section 5.5 are thus not strongly dependent on the normal assumption for the 
population distribution of the parameters 0;. If Figure 17.1 had shown a strong dependence 
on v—as Figures 5.5-5.7 showed dependence on 7—then it might make sense to include v 
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Figure 17.2 Posterior simulations of 1/v from the Gibbs-Metropolis computation of the robust model 
for the educational testing example, with v treated as unknown. 


as a hyperparameter, after thinking more seriously about a joint prior distribution for the 
parameters with noninformative prior distributions— (y, T, v). 


Discussion 


Robustness and sensitivity to modeling assumptions depend on the estimands being studied. 
In the SAT coaching example, posterior medians, 50%, and 95% intervals for the eight 
school effects are insensitive to the assumption of a normal population distribution (at least 
as compared to the ¢ family). In contrast, it may be that 99.9% intervals are strongly 
dependent on the tails of the distributions and sensitive to the degrees of freedom in the 
t distribution—fortunately, these extreme tails are unlikely to be of substantive interest in 
this example. 


17.5 Robust regression using t-distributed errors 


As with other models based on the normal distribution, inferences under the normal linear 
regression model of Chapter 14 are sensitive to unusual or outlying values. Robust regression 
analyses are obtained by considering robust alternatives to the normal distribution for 
regression errors. Robust error distributions, such as the t with few degrees of freedom, 
treat observations far from the regression line as high-variance observations, yielding results 
similar to those obtained by downweighting outliers. (Recall that the ‘weights’ in weighted 
linear regression are inverse variances.) 


Iterative weighted linear regression and the EM algorithm 


To illustrate robust regression calculations, we consider the t, regression model with fixed 
degrees of freedom as an alternative to the normal linear regression model. The conditional 
distribution of the individual response variable y; given the vector of explanatory variables 
X; is plyi| Xib, 07) = tr (yil Xib, 07). The t, distribution can be expressed as a mixture as in 
equation (17.1) with X;6 as the mean. As a first step in the robust analysis, we find the mode 
of the posterior distribution p(8,o?°|v, y) given the vector y consisting of n observations. 
Here we assume that a noninformative prior distribution is used, p(jz,logo|v) « 1; more 
substantial information about the regression parameters can be incorporated exactly as in 
Section 14.8 and Chapter 15. The posterior mode of p(@,logo|v,y) under the £ model 
can be obtained directly using Newton’s method (Section 13.1) or any other mode-finding 
technique. Alternatively, we can take advantage of the mixture form of the t model and 
use the EM algorithm with the variances V; treated as ‘missing data’ (that is, parameters 
to be averaged over); in the notation of Section 13.4, y = (Vi,...,Vn). The E-step of the 
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EM algorithm computes the expected value of the sufficient statistics for the normal model 
(Soy y? / Via ie yi/ Vis X. 1/Vi), given the current parameter estimates (8°!4, o°’) and 
averaging over (Vj,...,Vn). It is sufficient to note that 


i 


al (17.6) 


P(Vilys, 8”, 0°", v) ~ Inv-x? ( FL 
and that i i 
+ 
Bl —l u, gd. gd tif, 
(7 Yi p? OP sv p(oold)2 + (yi — X16819)? 

The M-step of the EM algorithm is a weighted linear regression with diagonal weight matrix 
W containing the conditional expectations of 1/V; on the diagonal. The updated parameter 
estimates are 


Anew — anew 1 anew anew 
BP = (XTWX) UXT Wy and (6)? = —(y — XB) Wy — XB), 


where X is the n x p matrix of explanatory variables. The iterations of the EM algorithm 
are equivalent to those performed in an iterative weighted least squares algorithm. Given 
initial estimates of the regression parameters, weights are computed for each case, with 
those cases having large residuals given less weight. Improved estimates of the regression 
parameters are then obtained by weighted linear regression. 

When the degrees of freedom parameter, v, is treated as unknown, the ECME algorithm 
can be applied, with an additional step added to the iteration for updating the degrees of 
freedom. 


Other robust models. Iterative weighted linear regression, or equivalently the EM algo- 
rithm, can be used to obtain the posterior mode for a number of robust alternative models. 
Changing the probability model used for the observation variances, V;, creates alternative 
robust models. For example, a two-point distribution can be used to model a regression 
with contaminated errors. The computations for robust models of this form are as described 
above, except that the E-step is modified to reflect the appropriate posterior conditional 
mean. 


Gibbs sampler and Metropolis algorithm 


Posterior draws from robust regression models can be obtained using Gibbs sampling and 
the Metropolis algorithm, as with the linear and generalized linear models discussed in 
Chapters 14-16. Using the mixture parameterization of the t, distribution, we can ob- 
tain draws from the posterior distribution p(8,07,Vi,...,Vnlv,y) by alternately sampling 
from p(8,07|Vi,...,Vn,V,y) using the usual posterior distribution from weighted linear re- 
gression, and sampling from p(Vi,...,Vn|8,07,v,y), a set of independent scaled inverse-? 
distributions as in equation (17.6). It can be even more effective to use parameter expansion 
as explained in Section 12.1. 

If the degrees of freedom parameter, v, is included as an unknown parameter in the 
model, then an additional Metropolis step is required in each iteration. In practice, these 
computations can be difficult to implement, because with low degrees of freedom v, the 
posterior distribution can have many modes, and the Gibbs sampler and Metropolis algo- 
rithms can get stuck. It is important to run many simulations with overdispersed starting 
points for complicated models of this form. 


17.6 Bibliographic note 


Mosteller and Wallace (1964) use the negative binomial distribution, instead of the Pois- 
son, for count data, and extensively study the sensitivity of their conclusions to model 
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assumptions. Box and Tiao (1968) provide another early discussion of Bayesian robustness, 
in the context of outliers in normal models. Smith (1983) extends Box’s approach and 
also discusses the t family using the same parameterization (inverse degrees of freedom) 
as we have. A review of models for overdispersion in binomial data, from a non-Bayesian 
point of view, is given by Anderson (1988), who cites many further references. Gaver and 
O’Muircheartaigh (1987) discuss the use of hierarchical Poisson models for robust Bayesian 
inference. O’Hagan (1979) and Gelman (1992a) discuss the connection between the tails 
of the population distribution of a hierarchical model and the shrinkage in the associated 
Bayesian posterior distribution. 

In a series of papers, Berger and coworkers have explored theoretical aspects of Bayesian 
robustness, examining, for example, families of prior distributions that provide maximum 
robustness against the influence of aberrant observations; see for instance Berger (1984, 
1990) and Berger and Berliner (1986). Related work appears in Wasserman (1992). An 
earlier overview from a pragmatic point of view close to ours was provided by Dempster 
(1975). Rubin (1983a) provides an illustration of the limitations of data in assessing model 
fit and the resulting inevitable sensitivity of some conclusions to untestable assumptions. 

With the recent advances in computation, modeling with the t distribution has become 
increasingly common in statistics. Dempster, Laird, and Rubin (1977) show how to apply 
the EM algorithm to t models, and Liu and Rubin (1995) and Meng and van Dyk (1997) 
discuss faster computational methods using extensions of EM. Lange, Little, and Taylor 
(1989) discuss the use of the t distribution in a variety of statistical contexts. Raghunathan 
and Rubin (1990) present an example using importance resampling. Tipping and Lawrence 
(2005) apply factorized variational approximation, Vanhatalo, Jylanki, Vehtari (2009) apply 
Laplace’s method, and Jylanki, Vanhatalo, and Vehtari (2011) apply expectation propaga- 
tion t models. Liu (2004) presents the ‘robit’ model as an alternative to logistic and probit 
regression. 

Rubin (1983b) and Lange and Sinsheimer (1993) review the connections between robust 
regression, the ¢ and related distributions, and iterative regression computations. 

Taplin and Raftery (1994) present an example of an application of a finite mixture model 
for robust Bayesian analysis of agricultural experiments. 


17.7 Exercises 


1. Prior distributions and shrinkage: in the educational testing experiments, suppose we 
think that most coaching programs are almost useless, but some are strongly effective; 
a corresponding population distribution for the school effects is a mixture, with most of 
the mass near zero but some mass extending far in the positive direction; for example, 


8 
P(O1,---,48) = L [AN 0;lui, 77) + A2N(8j| 42, 73)]- 
j=l 


All these parameters could be estimated from the data (as long as we restrict the pa- 
rameter space, for example by setting pı > u2), but to fix ideas, suppose that uı = 0, 
Tı = 10, H2 = 15, T2 = 25, AL = 0.9, and A2 = 0.1, 
(a) Compute the posterior distribution of (81,...,03) under this model for the data in 
Table 5.2. 
(b) Graph the posterior distribution for 6g under this model for yg = 0, 25, 50, and 100, 
with the same standard deviation og as given in Table 5.2. Describe qualitatively the 
effect of the two-component mixture prior distribution. 


2. Poisson and negative binomial distributions: as part of their analysis of the Federalist 
papers, Mosteller and Wallace (1964) recorded the frequency of use of various words in 
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Number of occurrences in a block 0 1 2 3 4 5 6 >6 
Number of blocks (Hamilton) 128 67 32 14 4 1 1 0 
Number of blocks (Madison) 156 638 29 8 4 1 1 0 


Table 17.2 Observed distribution of the word ‘may’ in papers of Hamilton and Madison, from 
Mosteller and Wallace (1964). Out of the 247 blocks of Hamilton’s text studied, 128 had no instances 
of ‘may,’ 67 had one instance of ‘may,’ and so forth, and similarly for Madison. 


selected articles by Alexander Hamilton and James Madison. The articles were divided 
into 247 blocks of about 200 words each, and the number of instances of various words 
in each block were recorded. Table 17.2 displays the results for the word ‘may.’ 


(a) Fit the Poisson model to these data, with different parameters for each author and 
a noninformative prior distribution. Plot the posterior density of the Poisson mean 
parameter for each author. 

(b) Fit the negative binomial model to these data with different parameters for each 
author. What is a reasonable noninformative prior distribution to use? For each 
author, make a contour plot of the posterior density of the two parameters and a 
scatterplot of the posterior simulations. 

3. Model checking with the Poisson and binomial distributions: we examine the fit of the 
models in the previous exercise using posterior predictive checks. 

(a) Considering the nature of the data and of likely departures from the model, what 
would be appropriate test statistics? 

(b) Compare the observed test statistics to their posterior predictive distribution (see 
Section 6.3) to test the fit of the Poisson model. 

(c) Perform the same test for the negative binomial model. 

4. Robust models and model checking: fit a robust model to Newcomb’s speed of light data 
(Figure 3.1). Check the fit of the model using appropriate techniques from Chapter 6. 

5. Contamination models: construct and fit a normal mixture model to the dataset used in 
the previous exercise. 

6. Robust models: 

(a) Choose a dataset from one of the examples or exercises earlier in the book and analyze 
it using a robust model. 

(b) Check the fit of the model using the posterior predictive distribution and appropriate 
test variables. 

(c) Discuss how inferences changed under the robust model. 

7. Computation for the t model: consider the model y1,..., Yn ~ iid ty (u, 07), with v fixed 
and a uniform prior distribution on (y, logo). 

(a) Work out the steps of the EM algorithm for finding posterior modes of (u, logo), 
using the specification (17.1) and averaging over Vi,...,Vn. Clearly specify the joint 
posterior density, its logarithm, the function Eoiq log p(u, logo, Vi,...,Vnly), and the 
updating equations for the M-step. 

(b) Work out the Gibbs sampler for drawing posterior simulations of (u, log a, Vi,..., Vn). 

(c) Illustrate the analysis with the speed of light data of Figure 3.1, using a t2 model. 

8. Robustness and sensitivity analysis: repeat the computations of Section 17.4 with the 
dataset altered as described on page 435 so that the observation yg is replaced by 100. 
Verify that, in this case, inferences are sensitive to v. Which values of v have highest 
marginal posterior density? 
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Chapter 18 


Models for missing data 


Our discussions of probability models in previous chapters, with few exceptions, assume 
that the desired dataset is completely observed. In this chapter we consider probability 
models and Bayesian methods for problems with missing data. This chapter applies some 
of the terminology and notation of Chapter 8, which describes models for collection and 
observation of data. 

We show how the analysis of problems involving missing data can often be separated 
into two main tasks: (1) multiple imputation—that is, simulating draws from the posterior 
predictive distribution of unobserved ymis conditional on observed values yops—and (2) 
drawing from the posterior distribution of model parameters 6. The general idea is to 
extend the model specification to incorporate the missing observations and then to perform 
inference by averaging over the distribution of the missing values. 

Bayesian inference draws no distinction between missing data and parameters; both are 
uncertain, and they have a joint posterior distribution, conditional on observed data. The 
practical distinction arises when setting up the joint model for observed data, unobserved 
data, and parameters. As discussed in Chapter 8, we prefer to set this up in three parts: 
a prior distribution for the parameters (along with a hyperprior distribution if the model 
is hierarchical), a joint model for all the data (missing and observed), and an inclusion 
model for the missingness process. If the missing data mechanism is ignorable (see Section 
8.2), then inference about the parameters and missing data can proceed without modeling 
the inclusion process. However, this process does need to be modeled when simulating 
replicated datasets for model checking. 


18.1 Notation 


We begin by reviewing some notation from Chapter 8, focusing on the problem of unin- 
tentional missing data. As in Chapter 8, let y represent the ‘complete data’ that would 
be observed in the absence of missing values. The notation is intended to be general; y 
may be a vector of univariate measures or a matrix with each row containing the multi- 
variate response variables of a single unit. Furthermore, it may be convenient to think of 
the complete data y as incorporating covariates, for example using a multivariate normal 
model for the vector of predictors and outcomes jointly in a regression context. We write 
Y = (Yobs; Ymis), Where Yobs denotes the observed values and ymis denotes the missing values. 
We also include in the model a random variable indicating whether each component of y is 
observed or missing. The inclusion indicator I is a data structure of the same size as y with 
each element of J equal to 1 if the corresponding component of y is observed and 0 if it is 
missing; we assume that J is completely observed. In a sample survey, item nonresponse 
corresponds to J;; = 0 for unit 7 and item j, and unit nonresponse corresponds to I;i; = 0 
for unit ¿i and all items j. 
The joint distribution of (y, I), given parameters (0, ), can be written as 


p(y, 110, 6) = p(yl@)pU|y, $). 


449 
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The conditional distribution of I given the complete dataset y, indexed by the unknown 
parameter ¢, describes the missing-data mechanism. The observed information is (yobs, T); 
the distribution of the observed data is obtained by integrating over the distribution of Ymis: 


P(Yobs, 19, $) = J ios Ymisl0 PM |Yobvs, Ymis, ¢)dymis. (18.1) 


Missing data are said to be missing at random (MAR) if the distribution of the missing-data 
mechanism does not depend on the missing values, 


p(L|Yons; Ymis; ¢) — P(L|Yoos; p), 


so that the distribution of the missing-data mechanism is permitted to depend on other 
observed values (including fully observed covariates) and parameters ¢. Formally, missing 
at random only requires the evaluation of p(I|y,@) at the observed values of yobs, not all 
possible values of yops.- Under MAR, the joint distribution (18.1) of yops, Z can be written 
as 


P(Yobs; I\6, o) = p(I|Yobs, o) J Pios Ymis|0)dYmis (18.2) 
= p(I|Yobs; 6)P(Yoos|9). 


If, in addition, the parameters governing the missing data mechanism, ¢, and the parameters 
of the data model, 0, are distinct, in the sense of being independent in the prior distribution, 
then Bayesian inferences for the model parameters 0 can be obtained by considering only 
the observed-data likelihood, p(yops|9). In this case, the missing-data mechanism is said to 
be ignorable. 

In addition to the terminology of the previous paragraph, we speak of data that are 
observed at random as well as missing at random if the distribution of the missing-data 
mechanism is completely independent of y: 


PUL |Yoos, Ymis, $) = p(I|¢). (18.3) 


In such cases, we say the missing data are missing completely at random (MCAR). The 
preceding paragraph shows that the weaker pair of assumptions of MAR and distinct pa- 
rameters is sufficient for obtaining Bayesian inferences without requiring further modeling 
of the missing-data mechanism. Since it is relatively rare in practical problems for MCAR 
to be plausible, we focus in this chapter on methods suitable for the more general case of 
MAR. 

The plausibility of MAR (but not MCAR) is enhanced by including as many observed 
characteristics of each individual or object as possible when defining the dataset y. Increas- 
ing the pool of observed variables (with relevant variables) decreases the degree to which 
missingness depends on unobservables given the observed variables. 

We conclude this section with a discussion of several examples that illustrate the ter- 
minology and principles described above. Suppose that the measurements consist of two 
variables y = (age, income) with age recorded for all individuals but income missing for 
some individuals. For simplicity of discussion, we model the joint distribution of the out- 
comes as bivariate normal. If the probability that income is recorded is the same for all 
individuals, independent of age and income, then the data are missing at random and ob- 
served at random, and therefore missing completely at random. If the probability that 
income is missing depends on the age of the respondent but not on the income of the re- 
spondent given age, then the data are missing at random but not observed at random. The 
missing-data mechanism is ignorable when, in addition to MAR, the parameters governing 
the missing-data process are distinct from those of the bivariate normal distribution (as is 
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typically the case with standard models). If, as seems likely, the probability that income is 
missing depends on age group and moreover on the value of income within each age group, 
then the data are neither missing nor observed at random. The missing data mechanism in 
this last case is said to be nonignorable. 

The relevance of the missing-data mechanism depends on the goals of the data analysis. 
If we are only interested in the mean and variance of the age variable, then we can discard 
all recorded income data and construct a model in which the missing-data mechanism is ig- 
norable. On the other hand, if we are interested in the marginal distribution of income, then 
the missing-data mechanism is of paramount importance and must be carefully considered. 

If information about the missing-data mechanism is available, then it may be possible 
to perform an appropriate analysis even if the missing-data mechanism is nonignorable, as 
discussed in Section 8.2. 


18.2 Multiple imputation 


Any single imputation provides a complete dataset that can be used by a variety of re- 
searchers to address a variety of questions. Assuming the imputation model is reasonable, 
the results from an analysis of the imputed dataset are likely to provide more accurate 
estimates than would be obtained by discarding data with missing values. 

The key idea of multiple imputation is to create more than one set of replacements for 
the missing values in a dataset. This addresses one of the difficulties of single imputation 
in that the uncertainty due to nonresponse under a particular missing-data model can be 
properly reflected. The data augmentation algorithm that is used in this chapter to obtain 
posterior inference can be viewed as iterative multiple imputation. 

The paradigmatic setting for missing data imputation is regression, where we are inter- 
ested in the model p(y|X,0) but have missing values in the matrix X. The full Bayesian 
approach would yield joint inference on Xmis, 6|Xobs, y. However, this involves the difficulty 
of constructing a joint probability model for the matrix X along with the (relatively simple) 
model for y|X. More fully, if X is modeled given parameters w, we would need to perform 
inference on Xmis, Y,0|Xovs, y. From a Bayesian perspective, there is no way around this 
problem, but it requires serious modeling effort that could take resources away from the 
primary goal, which is the model p(y|X, 6). 

Bayesian computation in a missing-data problem is based on the joint posterior distribu- 
tion of parameters and missing data, given modeling assumptions and observed data. The 
result of the computation is a set of vectors of simulations of all unknowns, (y%;., 0°), s = 
1,..., 5. At this point, there are two possible courses of action: 


e Obtain inferences for any parameters, missing data, and predictive quantities of interest. 


e Report observed data and simulated vectors y= ;., which are called multiple imputations. 
Other users of the data can then analyze these imputed complete datasets and without 
needing to model the missing-data mechanism. 


In the context of this book, the first option seems most natural, but in practice, especially 

when most of the data values are not missing, it is often useful to divide a data analysis 

in two parts: first, cleaning the data and multiply imputing missing values, and second, 
performing inference about quantities of interest using the imputed datasets. 
In multiple imputation, inferences for Xmis and 6 are separated: 

1. First model X,y together and, as described in the previous paragraph, obtain joint 
inferences for the missing data and parameters of the model. At this point, the imputer 
takes the surprising step of discarding the inferences about the parameters, keeping only 
the completed datasets X° = (Xobs, X ŝis) for a few random simulation draws s. 


2. For each imputed X*, perform the desired inference for 0 based on the model p(y|X°*, 8), 
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treating the imputed data as if they were known. These are the sorts of models discussed 
throughout Part IV of this book. 


3. Combine the inferences from the separate imputations. With Bayesian simulation, this 
is simple—just mix together the simulations from the separate inferences. At the end of 
Section 18.2, we briefly discuss some analytic methods for combining imputations. 


In this chapter, we give a quick overview of theory and methods for multiple imputation— 
that is, step 1 above—and missing-data analysis, illustrating with two examples from survey 
sampling. 


Computation using EM and data augmentation 


The process of generating missing data imputations usually begins with crude methods of 
imputation based on approximate models such as MCAR. The initial imputations are used 
as starting points for iterative mode-finding and simulation algorithms. 

Chapters 11 and 13 describe the Gibbs sampler and the EM algorithm in some detail 
as approaches for drawing simulations and obtaining the posterior mode in complex prob- 
lems. As was mentioned there, the Gibbs sampler and EM algorithms formalize a fairly 
old approach to handling missing data: replace missing data by estimated values, estimate 
model parameters, and perhaps, repeat these two steps several times. Often, a problem with 
no missing data can be easier to analyze if the dataset is augmented by some unobserved 
values, which may be thought of as missing data. 

Here, we briefly review the EM algorithm and its extensions using the notation of this 
chapter. Similar ideas apply for the Gibbs sampler, except that the goal is simulation 
rather than point estimation of the parameters. The algorithms can be applied whether 
the missing data are ignorable or not by including the missing-data model in the likelihood, 
as discussed in Chapter 8. For ease of exposition, we assume the missing-data mechanism 
is ignorable and therefore omit the inclusion indicator J in the following explanation. The 
generalization to specified nonignorable models is relatively straightforward. We assume 
that any augmented data, for example, mixture component indicators, are included as part 
of Ymis- Converting to the notation of Sections 13.4 and 13.5: 


Notation in Section 13.4 Notation for missing data 

Data, y Observed data including 
inclusion information, (yobs, 1) 

Marginal mode of parameters @ Posterior mode of parameters 0 
(if the missingness mechanism 
is being estimated, (0, ¢)) 

Averaging over parameters y Averaging over missing data Yymis. 


The EM algorithm is best known as it is applied to exponential families. In that case, 
the expected complete-data log posterior density is linear in the expected complete-data 
sufficient statistics so that only the latter need be evaluated or imputed. Examples are 
provided in the next section. 


EM for nonignorable models. For a nonignorable missingness mechanism, the EM algo- 
rithm can also be applied as long as a model for the missing data is specified (for example, 
censored or rounded data with known censoring point or rounding rule). The only change 
in the EM algorithm is that all calculations explicitly condition on the inclusion indicator I. 
Specifically, the expected complete-data log posterior density is a function of model param- 
eters 0 and missing-data-mechanism parameters ¢, conditional on the observed data Yops 
and the inclusion indicator I, averaged over the distribution of Ymis at the current values of 
the parameters (6°'4, ¢°4), 
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Computational shortcut with monotone missing-data patterns. A dataset is said to have a 
monotone pattern of missing data if the variables can be ordered in blocks such that the 
first block of variables is more observed than the second block of variables (that is, values in 
the first block are present whenever values in the second are present but the converse does 
not necessarily hold), the second block of variables is more observed than the third block, 
and so forth. Many datasets have this pattern or nearly so. Obtaining posterior modes 
can be especially easy when the data have a monotone pattern. For instance, with normal 
data, rather than compute a separate regression estimate conditioning on Yops in the E-step 
for each observation, the monotone pattern implies that there are only as many patterns of 
missing data as there are blocks of variables. Thus, all of the observations with the same 
pattern of missing data can be handled in a single step. For data that are close to the 
monotone pattern, the EM algorithm can be applied as a combination of two approaches: 
first, the E-step can be carried out for those values of Ymis that are outside the monotone 
pattern; then, the more efficient calculations can be carried out for the missing data that 
are consistent with the monotone pattern. An example of a monotone data pattern appears 
in Figure 18.1 on page 459. 


Inference with multiple imputations 


In some application areas such as sample surveys, a relatively small number of multiple 
imputations can typically be used to investigate the variability of the missing-data model, 
and some simple approximate inferences are widely applicable. To be specific, if there are 
sets of imputed values under a single model, let 6, and Wk, k = 1,..., K, be the K complete- 
data parameter estimates and associated variance estimates for the avalat parameter 0. The 
K complete-data analyses can be combined to form the combined estimate of 0, 


I 


The variability associated with this estimate has two components: the average of the com- 
plete-data variances (the within-imputation component), 


1& 
K= K `o Wk, 
k=1 
and the between-imputation variance, 


Bp = =y i —ĝr}. 


K 
K-1 
k=1 


The total variance associated with 0 is 


K+1 


Tg =Wrt+ Br. 


The standard approximate formula for creating interval estimates for 0 uses a t distribution 
with degrees of freedom, 


K Wr 


an approximation computed by matching moments of the variance estimate. 
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If the fraction of missing information is low, posterior inference will likely not be sensitive 
to modeling assumptions about the missing-data mechanism. One approach is to create a 
‘reasonable’ missing-data model, and then check the sensitivity of the posterior inferences to 
other missing-data models. In particular, it often seems helpful to begin with an ignorable 
model and explore the sensitivity of posterior inferences to plausible nonignorable models. 


18.3 Missing data in the multivariate normal and t models 


We consider the basic continuous-data model in which y represents a sample of size n from 
a d-dimensional multivariate normal distribution Na(u,¥) with yops the set of observed 
values and ymis the set of missing values. We present the methods here for the uniform 
prior distribution, and then in the next section we give an example with a hierarchical 
multivariate model. 


Finding posterior modes using EM 


The multivariate normal is an exponential family with sufficient statistics 


n 
X vij, J=; 
i=1 


and 
n 
5 Yitik Jk = lygad, 
i=1 
Let Yobsi denote the components of yi = (yi1,..., Yia) that are observed and ymisi denote 


the missing components. Let 6°¢ = (u°!4, £914) denote the current estimates of the model 
parameters. The E step of the EM algorithm computes the expected value of these suf- 
ficient statistics conditional on the observed values and the current parameter estimates. 
Specifically, 


nm m 
E (>: Yij|Yobs; om) = y yga 
4=1 =| 
nm 


n 
E (>: YijYik|Yobs; ga) = >D (U YRT + ha) 
i=l 


i=1 
where 
old __ f Yij if yi; is observed 
Vij E(yij|Yobs, pt) if Yij is missing, 
and 
cold — 0 if yi; Or Yik are observed 
Wk | cov(yiz, YiklYoos, O°"), if yij and yig are missing. 


The conditional expectation and covariance are easy to compute: the conditional posterior 
distribution of the missing elements of Yi, Ymisi, given Yops and 6, is multivariate normal 
with mean vector and variance matrix obtained from the full mean vector and variance 
matrix 6°'4 as in Appendix A. 

The M-step of the EM algorithm uses the expected complete-data sufficient statistics to 
compute the next iterate, OPW = (ue, UP°”). Specifically, 


1 l 
C= > an, for j=1,...,d, 
i=l 
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and 
n 


ong” = = a et) — ap une, for j,k = 1 aad 
i=1 
Starting values for the EM algorithm can be obtained using crude methods. As always 
when finding posterior modes, it is wise to use several starting values in case of multiple 
modes. It is crucial that the initial estimate of the variance matrix be positive definite; 
thus various estimates based on complete cases (that is, units with all outcomes observed), 
if available, can be useful. 


Drawing samples from the posterior distribution of the model parameters 


One can draw imputations for the missing values from the normal model using the modal 
estimates as starting points for data augmentation (the Gibbs sampler) on the joint posterior 
distribution of missing values and parameters, alternately drawing Ymis, 4, and © from their 
conditional posterior distributions. For more complicated models, some of the steps of the 
Gibbs sampler must be replaced by Metropolis steps. 

As with the EM algorithm, considerable gains in efficiency are possible if the missing 
data have a monotone pattern. In fact, for monotone missing data, it is possible under an 
appropriate parameterization to draw directly from the incomplete-data posterior distribu- 
tion, p(@|yobs). Suppose that yı is more observed than y2, y2 is more observed than ys, 
and so forth. To be specific, let y = W(@) = (Y1, ..., Yk), where WY denotes the parameters 
of the marginal distribution of the first block of variables in the monotone pattern y1, %2 
denotes the parameters of the conditional distribution of y2 given yı, and so on (the nor- 
mal distribution is d-dimensional, but in general the monotone pattern is defined by k < d 
blocks of variables). For multivariate normal data, 7); contains the parameters of the linear 
regression of yj on y1,...,Yj—1——the regression coefficients and the residual variance matrix. 
The parameter ~ is a one-to-one function of the parameter 0, and the complete parameter 
space of w is the product of the parameter spaces of w1,...,w,. The likelihood factors into 
k distinct. pieces: 


log p(Yoos|W) = log P(Yovs 1 l1) + log p(Yobs 2|Yobs 1; pa) ag 
++ + log P(Yobs k|Yobs 1) Yobs 2, - - - , Yobs k1; Yk), 


with the jth piece depending only on the parameters Yj. If the prior distribution p(w) can 
be written in closed form in the factorization, 


p) = phi plplpi): e pkpi, -3 Pk), 


then it is possible to draw directly from the posterior distribution in sequence: draw %1 
conditional on Yobs, then %2 conditional on Yı and yobs, and so forth. 

For a missing-data pattern that is not precisely monotone, we can define a monotone data 
augmentation algorithm that imputes only enough data to obtain a monotone pattern. The 
imputation step draws a sample from the conditional distribution of the elements in Ymis that 
are needed to create a monotone pattern. The posterior step then draws directly from the 
posterior distribution taking advantage of the monotone pattern. Typically, the monotone 
data augmentation algorithm will be more efficient than ordinary data augmentation if 
the departure from monotonicity is not substantial, because fewer imputations are being 
done and analytic calculations are being used to replace the other simulation steps. There 
may be several ways to order the variables that each lead to nearly monotone patterns. 
Determining the best such choice is complicated since ‘best’ is defined by providing the 
fastest convergence of an iterative simulation method. One simple approach is to choose 
yı to be the variable with the fewest missing values, y2 to be the variable with the second 
fewest, and so on. 
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Extending the normal model using the t distribution 


Chapter 17 described robust alternatives to the normal model based on the ¢ distribution. 
Such models can be useful for accommodating data prone to outliers, or as a means of 
performing a sensitivity analysis on a normal model. Suppose now that the intended data 
consist of multivariate observations, 


yilð, Vi~ Na(u, Vid), 


where V; are unobserved independent random variables that have a common Iny-y?(v, 1) 
distribution. For simplicity, we consider v to be specified; if unknown, it is another param- 
eter to be estimated. 

Data augmentation can be applied to the t model with missing values in y; by adding a 
step that imputes values for the V;, which are thought of as additional missing data. The 
imputation step of data augmentation consists of two parts. First, a value is imputed for 
each V; from its posterior distribution given yops,6,v. This posterior density is a product 
of the normal density for Yopsi given V; and the scaled inverse-y? prior density for Vj, 


D(Vi|Yoos; 0, v) x N(Yobsi|Lobsi; DobsiVi )Inv-x? (Viv, 1), (18.4) 


where [obsi; “obsi refer to the elements of the mean vector and variance matrix correspond- 
ing to components of y; that are observed. The conditional posterior distribution (18.4) 
is easily recognized as scaled inverse-x?, so obtaining imputed values for V; is straightfor- 
ward. The second part of each iteration step is to impute the missing values Ymisi given 
(Yoos, 9, Vi), which is identical to the imputation step for the ordinary normal model since 
given V;, the value of Ymisi is obtained as a draw from the conditional normal distribution. 
The posterior step of the data augmentation algorithm treats the imputed values as if they 
were observed and is, therefore, a complete-data weighted multivariate normal problem. 
The complexity of this step depends on the prior distribution for 0. 

The E-step of the EM algorithm for the t extensions of the normal model is obtained 
by replacing the imputation steps above with expectation steps. Thus the conditional 
expectation of V; from its scaled inverse-y? posterior distribution and conditional means 
and variances of Ymisi would be used in place of random draws. The M-step finds the 
conditional posterior mode rather than sampling from the posterior distribution. When the 
degrees of freedom parameter for the t distribution is allowed to vary, the ECM and ECME 
algorithms of Section 13.4 can be used. 


Nonignorable models 


The principles for performing Bayesian inference in nonignorable models are analogous to 
those presented in the ignorable case. At each stage, 0 is supplemented by any parameters 
of the missing-data mechanism, ¢, and inference is conditional on observed data yop; and 
the inclusion indicator I. 


18.4 Example: multiple imputation for a series of polls 


This section presents an extended example illustrating many of the practical issues involved 
in a Bayesian analysis of a survey data problem. The example concerns a series of polls 
with relatively large amounts of missing data. 


Background 


The U.S. presidential election of 1988 was unusual in that the Democratic candidate, Michael 
Dukakis, was far ahead in the polls four months before the election, but he eventually lost by 
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a large margin to the Republican candidate, George Bush. We studied public opinion during 
this campaign using data from 51 national polls conducted by nine different major polling 
organizations during the six months before the election. One of our goals was to examine 
time trends in vote intentions for different subgroups of the population (for example: men 
and women; self-declared Democrats, Republicans, and independents; low-income and high- 
income; and so forth). 

Analyzing the 51 surveys required some care in handling the missing data, because not all 
questions of interest were asked in all surveys. For example, self-reported ideology (liberal, 
moderate, or conservative), a key variable, was missing in 10 of the 51 surveys, including 
our only available surveys during the Democratic nominating convention. Questions about 
the respondent’s views of the national economy and of the perceived ideologies of Bush 
and Dukakis were asked in fewer than half of the surveys, making them difficult to study. 
Imputing the responses to these questions would simplify our analysis by allowing us to 
analyze all questions in the same way. 

When imputing missing data from several sample surveys, there are two obvious ways to 
use existing single-survey methods: (1) separately imputing missing data from each survey, 
or (2) combining the data from all the surveys and imputing missing data in the combined 
‘data matrix.’ Both of these methods have problems. The first approach is difficult if there 
is a large amount of missingness in each individual survey. For example, if a particular 
question is not asked in one survey, then there is no general way to impute it without using 
information from other surveys or some additional knowledge about the relation between 
responses to that question and to other questions asked in the survey. The second method 
does not account for differences between the surveys—for example, if they are conducted at 
different times, use different sampling methodologies, or are conducted by different survey 
organizations. 

Our approach is to compromise by fitting a separate imputation model for each survey, 
but with the parameters in the different survey models linked with a hierarchical model. 
This method should have the effect that imputations of item nonresponse in a survey will be 
determined largely by the data from that survey, whereas imputations for questions that are 
not asked in a survey will be determined by data from the other surveys in the population 
as well as by available responses to other questions in that survey. 


Multivariate missing-data framework 


We impute using a hierarchical multivariate model of Q = 15 questions asked in the S = 51 
surveys, with not all questions asked in all surveys. The 15 questions included the outcome 
variable of interest (presidential vote preference), the variables that were believed to have 
the strongest relation to vote preference, and several demographic variables that were fully 
observed or nearly so. We also include in our analysis the date at which each survey was 
conducted. 

To handle both sources of missingness—‘not asked’ and ‘not answered’—we augment 
the data in such a way that the complete data consist of the same Q questions in all the S$ 
surveys. We denote by Ysi = (Ysi1,---,; YsiQ ) the responses of individual i in survey s to all 
the Q questions. Some of the elements of ysi may be missing. Letting N, be the number 
of respondents in survey s, the (partially unobserved) complete data have the form, 


((Ysi1,--+;¥siQ)! i= Lies ng Nas s= 1,...,8). (18.5) 


We assume ignorable missingness (see Section 8.2), plausible because almost all the missing- 
ness here is due to unasked questions. If clear violations of ignorability occur (for example, 
a question about defense policy may be more likely to be asked when the country is at war), 
then we would add survey-level variables (for example, the level of international tension at 
the time of the survey) so that it is once again reasonable to assume missingess at random. 
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A hierarchical model for multiple surveys 


The simplest model for imputing the missing values in (18.5) that makes use of the data 
structure of the multiple surveys is a hierarchical model. We assume a multivariate normal 
distribution at the individual level with mean vector us for survey s and a variance matrix 
W assumed common to all surveys, 

ysillts, Y,0, 5 I No(us,W), i=1,...,Ns, s=1,...,8. (18.6) 
The assumption that the variance matrix WV is the same for all surveys could be tested by, 
for example, dividing the surveys nonrandomly into two groups (for example, early surveys 
and late surveys) and estimating separate matrices ® for the two groups. 

We might continue by modeling the survey means 4s as exchangeable but our background 
discussion suggests that factors at the survey level, such as organization effects and time 
trend, be included in the model. Suppose we have data on P covariates of interest at the 
survey level. Let £, = (a41,...,@sp) be the vector of these covariates for survey s. We 
assume that x, is fully observed for each of the S surveys. We use the notation X for 
the S x P matrix of the fully observed covariates (with row s containing the covariates for 
survey s). Then our model for the survey means is a multivariate multiple regression 

us|X, 6, E $ No(Bts,E), s=1,...,9, (18.7) 
where 8 is the Q x P matrix of the regression coefficients of the mean vector on the survey 
covariate vector and © is a diagonal matrix, Diag(o7,... OQ): Since X is diagonal, equation 
(18.7) represents Q linear regression models with normal errors: 


[sj|X, b, E ~ N(Bj£s,03), s=1,...,8, gH hens, (18.8) 


where jis; is the jth component of us and p; is the jth row of 8. The model in equations 
(18.6) and (18.7) allows for pooling information from all the S surveys and imputing all the 
missing values, including those to the questions that were not asked in some of the surveys. 

For mathematical convenience we use the following noninformative prior distribution for 


(X, 8,%): 


Q 
p(W, 8, £) = p(W)p(B) | | plod) x v79. (18.9) 


If there are fewer than Q individuals with responses on all the questions, then it is necessary 
to use a proper prior distribution for Y. 


Use of the continuous model for discrete responses 


There is a natural concern when using a continuous imputation model based on the normal 
distribution for survey responses, which are coded at varying levels of discretization. Some 
variables in our analysis (sex, ethnicity) are coded as unordered and discrete, others (voting 
intention, education) are ordered and discrete, whereas others (age, income, and the opinion 
questions on 1-5 and 1-7 scales) are potentially continuous, but are coded as ordered and 
discrete. We recode the responses from different survey organizations as appropriate so 
that the responses from each question fall on a common scale. (For example, for the 
surveys in which the ‘perceived ideology’ questions are framed as too-liberal/just-right /too- 
conservative, the responses are recoded based on the respondent’s stated ideology.) 

There are several ways to adapt a continuous model to impute discrete responses. From 
the most elaborate to the simplest, these include (1) modeling the discrete responses condi- 
tional on a latent continuous variable as described in Chapter 16 in the context of generalized 
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Figure 18.1 Approximate monotone data pattern for 51 polls conducted during the 1988 U.S. pres- 
idential election campaign. Not all questions were asked in all surveys. 


linear models, (2) modeling the data as continuous and then using some approximate pro- 
cedure to impute discrete values for the missing responses, and (3) modeling the data as 
continuous and imputing continuous values. We follow the third, simplest approach. In our 
example, little is lost by this simplification, because the variables that are the most ‘discrete’ 
(sex, ethnicity, vote intention) are fully observed or nearly so, whereas the variables with 
the most nonresponse (the opinion questions) are essentially continuous. When discrete 
values are needed, we round off the continuous imputations, essentially using approach (2). 


Computation 


The model is complicated enough and the dataset large enough that it was necessary to 
write a special program to perform posterior simulations. A simple Gibbs or Metropolis 
algorithm would run too slowly to work here. Our method uses two basic steps: data 
augmentation to form a monotone missing data pattern (as described in Section 18.3), and 
the Gibbs sampler to draw simulations from the joint posterior distribution of the missing 
data and parameters under the monotone pattern. 

Monotone data augmentation is effective for multiple imputation for multiple surveys 
because the data can be sorted so that a large portion of the missing values fall into a 
monotone pattern due to the fact that some questions are not asked in some surveys. Figure 
18.1 illustrates a constructed monotone data pattern for the pre-election surveys, with the 
variables arranged in decreasing order of proportion of missing data. The most observed 
variable is sex, and the least observed concern the candidates’ perceived ideologies. Let k 
index the Q possible observed data patterns, ng be the number of respondents with that 
pattern, and s; the survey containing the ith respondent in the kth observed data pattern. 
The resulting data matrix can be written as 


Ymp = (y. ea = I -3 Nk, k = Ize rQ) (18.10) 


where the subscript ‘mp’ denotes that this is the data matrix for the monotone pattern. The 
monotone pattern data matrix in (18.10) still contains missing values; that is, some of the 
values in the bottom right part of Figure 18.1 are not observed even though the question 
was asked of the respondent. We denote by Ymp,mis the set of all the missing values in 
(18.10) and by yobs the set of all the observed values. Thus, we have ymp = (Yobs; Ymp,mis): 

Using the data in the form (18.10) and the model (18.6), (18.7), and (18.9), we have 
observed data yops and unknowns, (Ymp,mis, Ý, 41,- --, 45,8, X). To take draws from the 
posterior distribution of (Ymp,mis, Ý, 41,- --, Hs, 8, ©) given observed data yops, we use a 
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version of the monotone Gibbs sampler where each iteration consists of the following three 
steps: 


1. Impute ymp,mis given Y, u1, ..., Ws, B, ©, Yoos- Under the normal model these imputations 
require a series of draws from the joint normal distribution for survey responses. 


2. Draw (WU, 8,%) given p1,...,5,Yobs;Ymp,mis- These are parameters in a hierarchical 
linear model with ‘complete’ (including the imputations) monotone pattern. 


3. Draw (u1,..., us) given VY, 6,4, yobs, Ymp,mis- The parameters u1,..., ps are mutually 
independent and normally distributed in this conditional posterior distribution. 


Accounting for survey design and weights 


The respondents for each survey were assigned sampling and poststratification weights which 
we did not use in the imputation procedure because the variables on which the weights were 
based were, by and large, included in the model already. Thus, the weights provide no extra 
information about the missing responses. 

We do, however, use the survey weights when computing averages based on the imputed 
data, in order to obtain unbiased estimates of population averages unconditional on the 
demographic variables. 


Results 


We examine the results of the imputation for the 51 surveys with a separate graph for each 
variable; we illustrate in Figure 18.2 with two of the variables: ‘income’ (chosen because it 
should remain stable during the four-month period under study) and ‘perceived ideology of 
Dukakis’ (which we expect to change during the campaign). 

First consider Figure 18.2a, ‘income.’ Each symbol on the graph represents a different 
survey, plotting the estimated average income for the survey vs. the date of the survey, with 
the symbol itself indicating the survey organization. The size of the symbol is proportional 
to the fraction of survey respondents who responded to the particular question, with the 
convention that when the question is not asked (indicated by circled symbols on the graph), 
the symbol is tiny but not of zero size. The vertical bars show +1 standard error in the 
posterior mean of the average income, where the standard error includes within-imputation 
sampling variance and between-imputation variance. Finally, the inner brackets on the 
vertical bars show the within-imputation standard deviation alone. All of the complete- 
data means and standard deviations displayed in the figure are weighted as described above. 
Even those surveys in which the question was asked (and answered by most respondents) 
have nonzero standard errors for the estimated population average income because of the 
finite sample sizes of the surveys. For surveys in which the income question was asked, 
the within-imputation variance almost equals the total variance, which makes sense since, 
when a question was asked, most respondents answered. The multiple imputation procedure 
makes weak statements (that is, there are large standard errors) about missing income 
responses for those surveys missing the income question. This makes sense because income 
is not highly correlated with the other questions. 

Figure 18.2a also shows some between-survey variability in average income, from $31,000 
to $37,000—more than can be explained by sampling variability, as is indicated by the error 
bars on the surveys for which the question was asked. Since we do not believe that the 
average income among the population of registered or likely voters is changing that much, 
the explanation must lie in the surveys. In fact, different pollsters use different codings for 
incomes (for example, 0-10K, 10-20K, 20-30K, ..., or 0-7.5K, 7.5-15K, 15-25K, ...). Since 
we seek to imputate what the surveys would look like if all the questions had been asked and 
answered, rather than to adjust all the observed and unobserved data to estimate population 
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The posterior means of "income" in the 51 surveys 
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The posterior means of "Dukakis’s perceived ideology” in the 51 surveys 
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Figure 18.2 Estimates and standard error bars for the population mean response for two questions: 
(a) income (in thousands of dollars), and (b) perceived ideology of Dukakis (on a 1-7 scale from 
liberal to conservative), over time. Each symbol represents a different survey, with different letters 
indicating different survey organizations. The size of the letter indicates the number of responses 
to the question, with large-sized letters for surveys with nearly complete response and small-sized 
letters for surveys with few responses. Circled letters indicate surveys for which the question was 
not asked; the estimates for these surveys have much larger standard errors. The inner brackets on 
the vertical bars show the within-imputation standard deviation for the average from each poll. 


quantities, this variability is reasonable. The large error bars for average income for the 
surveys in which the question was not asked reflect the large between-survey variation 
in average income, which is captured by our hierarchical model. For this study, we are 
interested in income as a predictor variable rather than for its own sake, and we are willing 
to accept this level of uncertainty. 

Figure 18.2b shows a similar plot for Dukakis’s perceived ideology. Both the observed 
and imputed survey responses show a time trend in this variable—Dukakis was perceived 
as more liberal near the end of the campaign. In an earlier version of the model (not shown 
here) that did not include time as a predictor, the plot of Dukakis’s perceived ideology 
showed a problem: the observed data showed the time trend, but the imputations did not. 
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This plot was useful in revealing the model flaw: the imputed survey responses looked 
systematically different from the data. 


18.5 Missing values with counted data 


We discuss fully observed counted data in Section 3.4 for saturated multinomial models and 
in Chapter 16 for loglinear models. Here we consider how those techniques can be applied 
to missing discrete data problems. 


Multinomial samples. Model the complete data as a multinomial sample of size n with 
cells c1,...,CJ, probabilities 0 = (m1,..., TJ), and cell counts n1,..., ng. Conjugate prior 
distributions for @ are in the Dirichlet family (see Appendix A and Section 3.4). The 
observed data are m completely classified observations with m; in cell j, and n—m partially 
classified observations (the missing data). The partially classified observations are known 
to fall in subsets of the J cells. For example, in a 2 x 2 x 2 table, J=8, and an observation 
with known classification for the first two dimensions but with missing classification for the 
third dimension is known to fall in one of two possible cells. 


Partially classified observations. It is convenient to organize each of the n — m partially 
classified observations according to the subset of cells to which it can belong. Suppose there 
are K types of partially classified observation, with subset S; containing the r observations 
of the kth type. 

The iterative procedure used for normal data in the previous subsection, data aug- 
mentation, can be used here to iterate between imputing cells for the partially classified 
observations and obtaining draws from the posterior distribution of the parameters 0. The 
imputation step draws from the conditional distribution of the partially classified cells given 
the observed data and the current set of parameters 0. For each k = 1,...,K, the rk par- 
tially classified observations known to fall in the subset of cells S are assigned randomly 
to each of the cells in S with probability 


Ti Ljes;, 
J $ 
Mier mies, 


where Jjes, is the indicator equal to 1 if cell 7 is part of the subset S; and 0 otherwise. 
When the prior distribution is Dirichlet, then the parameter updating step requires drawing 
from the conjugate Dirichlet posterior distribution, treating the imputed data and the 
observed data as a complete dataset. As usual, it is possible to use other, nonconjugate, 
prior distributions, although this makes the posterior computation more difficult. The EM 
algorithm for exploring modes is a nonstochastic version of data augmentation: the E-step 
computes the number of the rẹ partially classified observations that are expected to fall in 
each cell (the mean of the multinomial distribution), and the M-step computes updated cell 
probability estimates by combining the observed cell counts with the results of the E-step. 

As usual, the analysis is simplified when the missing-data pattern is monotone or nearly 
monotone, so that the likelihood can be written as a product of the marginal distribution of 
the most observed set of variables and a set of conditional distributions for each subsequent 
set of variables conditional on all of the preceding, more observed variables. If the prior 
density is also factored, for example, as a product of Dirichlet densities for the parameters 
of each factor in the likelihood, then the posterior distribution can be drawn from directly. 
The analysis of nearly monotone data requires iterating two steps: imputing values for those 
partially classified observations required to complete the monotone pattern, and drawing 
from the posterior distribution, which can be done directly for the monotone pattern. 

Further complications arise when the cell probabilities 0 are modeled, as in loglinear 
models; see the references at the end of this chapter. 
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Independence 
Secession Attendance Yes No Don’t know 
Yes 1191 8 21 
Yes No 8 0 4 
Don’t Know 107 3 9 
Yes 158 68 29 
No No 7 14 3 
Don’t Know 18 43 31 
Don’t Yes 90 2 109 
Know No 1 2 25 
Don’t Know 19 8 96 


Table 18.1 3 x 3 x 3 table of results of 1990 preplebiscite survey in Slovenia, from Rubin, Stern, 
and Vehovar (1995). We treat ‘don’t know’ responses as missing data. Of most interest is the 
proportion of the electorate whose ‘true’ answers are ‘yes’ on both ‘independence’ and ‘attendance.’ 


18.6 Example: an opinion poll in Slovenia 


We illustrate the methods described in the previous section with the analysis of an opinion 
poll concerning independence in Slovenia, formerly a province of Yugoslavia and now a 
nation. In 1990, a plebiscite was held in Slovenia at which the adult citizens voted on 
the question of independence. The rules of the plebiscite were such that nonattendance, 
as determined by an independent and accurate census, was equivalent to voting ‘no’; only 
those attending and voting ‘yes’ would be counted as being in favor of independence. In 
anticipation of the plebiscite, a Slovenian public opinion survey had been conducted that 
included several questions concerning likely plebiscite attendance and voting. In that survey, 
2074 Slovenians were asked three questions: (1) Are you in favor of independence?, (2) Are 
you in favor of secession?, and (3) Will you attend the plebiscite? The results of the survey 
are displayed in Table 18.1. Let a represent the estimand of interest from the sample survey, 
the proportion of the population planning to attend and vote in favor of independence. It 
follows from the rules of the plebiscite that ‘don’t know’ (DK) can be viewed as missing data 
(at least accepting that ‘yes’ and ‘no’ responses to the survey are accurate for the plebiscite). 
Every survey participant will vote yes or no—perhaps directly or perhaps indirectly by not 
attending. 

Why ask three questions when we only care about two of them? The response to question 
2 is not directly relevant but helps us more accurately impute the missing data. The 
survey participants may provide some information about their intentions by their answers 
to question 2; for example, a ‘yes’ response to question 2 might be indicative of likely 
support for independence for a person who did not answer question 1. 


Crude estimates 


As an initial estimate of a, the proportion planning to attend and vote ‘yes,’ we ignore the 
DK responses for these two questions; considering only the ‘available cases’ (those answering 
the attendance and independence questions) yields a crude estimate â = 1439/1549 = 0.93, 
which seems to suggest that the outcome of the plebiscite is not in doubt. However, given 
that only 1439/2074, or 69%, of the survey participants definitely plan to attend and vote 
‘yes,’ and given the importance of the outcome, improved inference is desirable, especially 
considering that if we were to assume that the DK responses correspond to ‘no,’ we would 
obtain a much different estimate. 

The 2074 responses include those of substitutes for original survey participants who 
could not be contacted after several attempts. Although the substitutes were chosen from 
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the same clusters as the original participants to minimize differences between substitutes 
and nonsubstitutes, there may be some concern about differences between the two groups. 
We indicate the most pessimistic estimate for a by noting that only 1251/2074 of the original 
intended survey sample (using information not included in the table) plan to attend and 
vote ‘yes.’ For simplicity, we treat substitutes as original respondents for the remainder of 
this section and ignore the effects of clustering. 


The likelihood and prior distribution 


The complete data can be viewed as a sample of 2074 observations from a multinomial 
distribution on the eight cells of the 2 x 2 x 2 contingency table, with corresponding vector 
of probabilities 0; the DK responses are treated as missing data. We use 4;;, to indicate 
the probability of the multinomial cell in which the respondent gave answer i to question 
1, answer j to question 2, and answer k to question 3, with 7,j,k = 0 for ‘no’ and 1 
for ‘yes.’ The estimand of most interest, a, is the sum of the appropriate elements of 0, 
a = 419, + 0111. In our general notation, yops are the observed ‘yes’ and ‘no’ responses, 
and Ymis are the ‘true’ ‘yes’ and ‘no’ responses corresponding to the DK responses. The 
‘complete data’ form a 2074 x 3 matrix of 0’s and 1’s that can be recoded as a contingency 
table of 2074 counts in eight cells. 


The Dirichlet prior distribution for 0 with parameters all equal to zero is noninforma- 
tive in the sense that the posterior mean is the maximum likelihood estimate. Since one 
of the observed cell counts is 0 (‘yes’ on secession, ‘no’ on attendance, ‘no’ on indepen- 
dence), the improper prior distribution does not lead to a proper posterior distribution. 
It would be possible to proceed with the improper prior density if we thought of this cell 
as being a structural zero—a cell for which a nonzero count is impossible. The assump- 
tion of a structural zero does not seem particularly plausible here, and we choose to use a 
Dirichlet distribution with parameters all equal to 0.1 as a convenient (though arbitrary) 
way of obtaining a proper posterior distribution while retaining a diffuse prior distribution. 
A thorough analysis should explore the sensitivity of conclusions to this choice of prior 
distribution. 


The model for the ‘missing data’ 


We treat the DK responses as missing values, each known only to belong to some subset of 
the eight cells. Let n = (mij) represent the hypothetical complete data and let m = (mijx) 
represent the number of completely classified respondents in each cell. There are 18 types 
of partially classified observations; for example, those answering ‘yes’ to questions 1 and 
2 and DK to question 3, those answering ‘no’ to question 1 and DK to questions 2 and 
3, and so on. Let rp denote the number of partially classified observations of type p; for 
example, let rı represent the number of those answering ‘yes’ to questions 1 and 2 and DK 
to question 3. Let Sp denote the set of cells to which the partially classified observations of 
the pth type might belong; for example, Sı includes cells 111 and 110. We assume that the 
DK responses are missing at random, which implies that the probability of a DK response 
may depend on the answers to other questions but not on the unobserved response to the 
question at hand. 


The complete-data likelihood is p(n|@) o [jo Ta = Oe with complete-data 
sufficient statistics n = (nijx). If we let Tijkp represent the probability that a partially 
classified observation of the pth type belongs in cell ijk, then the MAR model implies 


that, given a set of parameter values 0, the distribution of the r, observations with the pth 


This electronic edition is for non-commercial purposes only. 


18.6. EXAMPLE: AN OPINION POLL IN SLOVENIA 465 


missing-data pattern is multinomial with probabilities 


: Vijklijkes, 
ijk p = SO? OO, 
Doig Fig k Ligh eS, 


where the indicator Jijzcs, is 1 if cell ijk is in subset Sp and 0 otherwise. 


Using the EM algorithm to find the posterior mode of 0 


The EM algorithm in this case finds the mode of the posterior distribution of the multinomial 
probability vector 0 by averaging over the missing data (the DK responses) and is especially 
easy here with the assumption of distinct parameters. For the multinomial distribution, 
the E-step, computing the expected complete-data log posterior density, is equivalent to 
computing the expected counts in each cell of the contingency table given the current 
parameter estimates. The expected count in each cell consists of the fully observed cases in 
the cell and the expected number of the partially observed cases that fall in the cell. Under 
the missing at random assumption and the resulting multinomial distribution, the DK 
responses in each category (that is, the pattern of missing data) are allocated to the possible 
cells in proportion to the current estimate of the model parameters. Mathematically, given 
current parameter estimate 0°'¢, the E-step computes 


old __ old) __ 
Nigh = E(Mijk|m, r, O°) = Mijk + X TpTijk p- 
p 


The M-step computes new parameter estimates based on the latest estimates of the 
expected counts in each cell; for the saturated multinomial model here (with distinct pa- 
rameters), 0°” = (nore +0.1)/(n+0.8). The EM algorithm is considered to converge when 
none of the parameter estimates changes by more than a tolerance criterion, which we set 
here to the unnecessarily low value of 10718. The posterior mode of a is 0.882, which turned 


out to be close to the eventual outcome in the plebiscite. 


Using SEM to estimate the posterior variance matrix and obtain a normal approximation 


To complete the normal approximation, we need estimates of the posterior variance matrix. 
The SEM algorithm numerically computes estimates of the variance matrix using the EM 
program and the complete-data variance matrix (which is available since the complete data 
are modeled as multinomial). 

The SEM algorithm is applied to the logit transformation of the components of 0, since 
the normal approximation is generally more accurate on this scale. Posterior central 95% 
intervals for logit(@) are transformed back to yield a 95% interval for a, [0.857, 0.903]. The 
standard error was inflated to account for the design effect of the clustered sampling design 
using approximate methods based on the normal distribution. 


Multiple imputation using data augmentation 


Even though the sample size is large in this problem, it seems prudent, given the missing 
data, to perform posterior inference that does not rely on the asymptotic normality of the 
maximum likelihood estimates. The data augmentation algorithm, a special case of the 
Gibbs sampler, can be used to obtain draws from the posterior distribution of the cell 
probabilities 6 under a noninformative Dirichlet prior distribution. As described earlier, for 
count data, the data augmentation algorithm iterates between imputations and posterior 
draws. At each imputation step, the rp cases with the pth missing-data pattern are allocated 
among the possible cells as a draw from a multinomial distribution. Conditional on these 
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imputations, a draw from the posterior distribution of 0 is obtained from the Dirichlet 
posterior distribution. A total of 1000 draws from the posterior distribution of 6 were 
obtained—the second half of 20 data augmentation series, each run for 100 iterations, at 
which point the potential scale reductions, R, were below 1.1 for all parameters. 


Posterior inference for the estimand of interest 


The posterior median of a, the population proportion planning to attend and vote yes, is 
0.883. We construct an approximate posterior central 95% interval for a by inflating the 
variance of the 95% interval from the posterior simulations to account for the clustering in 
the design (to avoid the complications but approximate the results of a full Bayesian analysis 
of this sampling design); the resulting interval is [0.859, 0.904]. It is not surprising, given 
the large sample size, that this interval matches the interval obtained from the asymptotic 
normal distribution. 

By comparison, neither of the initial crude calculations is close to the actual outcome, 
in which 88.5% of the eligible population attended and voted ‘yes.’ 


18.7 Bibliographic note 


The jargon ‘missing at random,’ ‘observed at random,’ and ‘ignorability’ originated with 
Rubin (1976). Skrondal and Rabe-Hesketh (2014) further classify missingness mechanisms 
in the context of hierarchical models. Multiple imputation was proposed in Rubin (1978b) 
and is discussed in detail in Rubin (1987a) and Rubin (1996) with a focus on sample 
surveys. Kish (1965) and Madow et al. (1983) discuss less formal ways of handling missing 
data in sample surveys. The book edited by Groves et al. (2002) reviews work on survey 
nonresponse. 

Little and Rubin (2002) is a comprehensive text on statistical analysis with missing data. 
Van Buuren (2012) is a recent text with a more computational focus. Tanner and Wong 
(1987) describe the use of data augmentation to calculate posterior distributions. Schafer 
(1997) and Liu (1995) apply data augmentation for multivariate exchangeable models, in- 
cluding the normal, t, and loglinear models discussed briefly in this chapter. The approx- 
imate variance estimate in Section 18.2 is derived from the Satterthwaite (1946) approxi- 
mation; see Rubin (1987a), Meng, Raghunathan, and Rubin (1991), and Meng and Rubin 
(1992). Abayomi, Gelman, and Levy (2008) discuss graphical methods for checking the fit 
and reasonableness of missing-data imputations. 

Meng (1994b) discusses the theory of multiple imputation when different models are 
used for imputation and analysis. Raghunathan et al. (2001) and Gelman and Raghu- 
nathan (2001) discuss the ‘inconsistent Gibbs’ method for multiple imputation. Van Buuren, 
Boshuizen, and Knook (1999) present a related approach; see Van Buuren and Oudshoom 
(2000). Su et al. (2011) discuss further directions in this area. 

Clogg et al. (1991) and Belin et al. (1993) describe hierarchical logistic regression models 
used for imputation for the U.S. Census. There has been growing use of multiple imputation 
using nonignorable models for missing data; for example, Heitjan and Landis (1994) set up a 
model for unobserved medical outcomes and multiply impute using matching to appropriate 
observed cases. David et al. (1986) present a thorough discussion and comparison of a variety 
of imputation methods for a missing data problem in survey imputation. 

The monotone method for estimating multivariate models with missing data dates from 
Anderson (1957) and was extended by Rubin (1974a, 1976, 1987a); see the rejoinder to 
Gelman, King, and Liu (1998) for more discussion of computational details. The example 
of missing data in opinion polls comes from Gelman, King, and Liu (1998). The Slovenia 
survey is described in more detail in Rubin, Stern, and Vehovar (1995). 
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18.8 Exercises 


1. Computation for discrete missing data: reproduce the results of Section 18.6 for the 2 x 2 
table involving independence and attendance. You can ignore the clustering in the survey 
and pretend it was obtained from a simple random sample. Specifically: 


(a) Use EM to obtain the posterior mode of a, the proportion who will attend and will 
vote ‘yes.’ 

(b) Use SEM to obtain the asymptotic posterior variance of logit(@), and thereby obtain 
an approximate 95% interval for a. 

(c) Use Markov chain simulation of the parameters and missing data to obtain the approx- 
imate posterior distribution of 0. Clearly state your starting distribution. Be sure to 
simulate more than one sequence and to include some diagnostics on the convergence 
of the sequences. 

2. Monotone missing data: create a monotone pattern of missing data for the opinion poll 
data of Section 18.6 by discarding some observations. Compare the results of analyzing 
these data with the results given in that section. 

3. Practical missing-data imputation: create a miniature version of the 2010 General Social 
Survey (publicly available on the Internet), including the following variables: sex, age, 
ethnicity (use four categories), urban/suburban/rural, education (use five categories), 
political ideology (on a 7-point scale from ‘extremely liberal’ to ‘extremely conservative’), 
and general happiness. 

(a) Using just the complete cases, fit a logistic regression on whether respondents feel ‘not 
too happy.’ 

(b) Impute the missing values using mi() in the mi package in R. Then take one of the 
completed datasets and fit a logistic regression as above. 

(c) Repeat, this time imputing using aregImpute() in the Hmisc package. 

(d) Repeat, this time imputing using mice() in the mice package. 

(e) Briefly discuss the differences between the four inferences above. 
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Part V: Nonlinear and Nonparametric Models 


We conclude with discussion and examples of more general models: parametric nonlinear 
models in Chapter 19 and then nonparametric models in Chapters 20-23. Parametric 
nonlinear models have some prespecified functional form with unknown parameters. Non- 
parametric models are characterized by having unknown functions without prespecified form 
as part of the model specification. Nonparametric refers to similarity with non-Bayesian 
nonparametric methods or to the fact that the parameters do not have the same sorts of 
interpretations as those in parametric models. 

In Chapters 20 and 21 we discuss nonparametric models, which may have a finite but 
arbitrarily large number of parameters allowing to approximate any function to any desired 
degree of accuracy. In Chapters 22 and 23 we discuss infinite dimensional generalizations 
where the prior is defined directly in infinite dimensional parameter space or function space. 
Actual computation is still made with finite dimensional presentation, but the number of 
parameters can increase as n grows. 

The key defining property for a useful nonparametric Bayesian model is that the prior 
distribution for an unknown density has large support, which informally means that the prior 
can generate functions that are arbitrarily close to any function in a broad class. Consider 
the density estimation problem and let f ~ II, with II denoting a prior over the set of 
all densities on the real line. Our notion of large support requires that the prior II assigns 
positive probability to arbitrarily small neighborhoods of the true density fo for a large class 
of true densities. If the prior assigns zero probability to a neighborhood of the true density 
fo, then the posterior will also assign zero probability to the neighborhood. Hence, if the 
true fo is not in the support of the prior II, then one will obtain inconsistent estimates of 
the density and finite sample inferences may be misleading. It follows logically that, in the 
absence of substantial prior knowledge about the shape of the density, one should ideally 
choose the prior II so that all fo (or perhaps a large subset discarding irregular densities) 
fall in the support of II. 
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Chapter 19 


Parametric nonlinear models 


The linear regression approach of Part IV suggests a presentation of statistical models in 
menu form, with a set of possible distributions for the response variable, a set of transfor- 
mations to facilitate the use of those distributions, and the ability to include information 
in the form of linear predictors. In a generalized linear model, the expected value of y is 
a nonlinear function of the linear predictor: E(y|X,8) = g~'(X8). Robust (Chapter 17) 
and mixture models (Chapter 22) generalize these by adding a latent (unobserved) mixture 
parameter for each data point. 

Generalized linear modeling is flexible and powerful, with coefficients 6 being relatively 
easy to interpret, especially when comparing to each other (since they all act on the same 
linear predictor). However, not all phenomena behave linearly, even under transformation. 
This chapter considers the more general case where parameters and predictors do not com- 
bine linearly. Simple examples include a ratio such as E(y) = ates, or a sum of nonlinear 
functions such as E(y) = Aye~°!* + Age~°?”; see Exercise 19.3. 

The general principles of inference and computation can be directly applied to nonlinear 
models. We briefly consider the three steps of Bayesian data analysis: model building, com- 
putation, and model checking. The most flexible approaches to nonlinear modeling typically 
involve complicated relations between predictors and outcomes, but generally without the 
necessity for unusual probability distributions. Computation can present challenges, be- 
cause we cannot simply adapt linear regression computations, as was done in Chapter 16 
for generalized linear models. Model checking can sometimes be performed using residual 
plots, x? tests, and other existing summaries, but sometimes new graphs need to be created 
in the context of a particular model. 

In addition, new difficulties arise in interpretation of parameters that cannot simply be 
understood in terms of a linear predictor. A key step in any inference for a nonlinear model 
is to display the fitted nonlinear relation graphically. 

Because nonlinear models come in many flavors, there is no systematic menu of options 
to present. Generally each new modeling problem must be tackled afresh. We present two 
examples from our own applied research to give some sense of the possibilities. 


19.1 Example: serial dilution assay 


A common design for estimating the concentrations of compounds in biological samples is 
the serial dilution assay, in which measurements are taken at several different dilutions of 
a sample. The reason for the serial dilutions of each sample is that the concentration in an 
assay is quantified by an automated optical reading of a color change, and there is a limited 
range of concentrations for which the color change is informative: at low values, the color 
change is imperceptible, and at high values, the color saturates. Thus, several dilutions give 
several opportunities for an accurate measurement. 

More precisely, the dilutions give several measurements of differing accuracy, and a 
likelihood or Bayesian approach should allow us to combine the information in these mea- 
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Unk2 Unk3 Unk4 Unk5 Unk6 Unk7 Unk& Unk9 ~~ Unk10 
1 1 1 1 1 1 1 
1/3 1/3 1/3 1/3 1/3 1/3 1/3 
1/9 1/9 1/9 1/9 1/9 1/9 1/9 


1/27 i/27 3/27 1/27 1/27 1/27 1/27 
1 1 1 1 1 1 1 


1/3 1/3 1/3 1/3 1⁄3 1/3 1/3 
1/9 1/9 1/9 1/9 1/9 1/9 1/9 
1/27 1/27 1/27 1/27 1/27 1/27 1/27 


Figure 19.1 Typical setup of a plate with 96 wells for a serial dilution assay. The first two columns 
are dilutions of ‘standards’ with known concentrations, and the other columns are ten different 
‘unknowns.’ The goal of the assay is to estimate the concentrations of the unknowns, using the 
standards as calibration. 


Standards data Unknown 1 Unknown 2 


y 
60 
y 
60 
y 
60 
| 


sp 
°o T T T T T °o T T T T T © T T T T 
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.4 0.8 0.0 0.4 0.8 
dilution of known compound dilution dilution 
Unknown 3 Unknown 4 Unknown 5 Unknown 6 
>88] ~ >g] >g] >g] 
J Jenser] Poe pe J, epi 
> T T T T > T T T T T > T T T T > T T T T T 
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 
dilution dilution dilution dilution 
Unknown 7 Unknown 8 Unknown 9 Unknown 10 
>g J ———] >g à >g J | >B J 
lA J. _ ee sn 
o T T T T o T T T T > T T T T o T T T 
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 
dilution dilution dilution dilution 


Figure 19.2 Data from a single plate of a serial dilution assay. The large graph shows the calibration 
data, and the ten small graphs show data for the unknown compounds. The goal of the analysis 
is to figure out how to scale the x-azes of the unknowns so they line up with the curve estimated 
from the standards. (The curves shown on these graphs are estimated from the model as described 
in Section 19.1.) 


surements appropriately. An assay comes on a plate with a number of wells each containing 
a sample or a specified dilution of a sample. There are two sorts of samples: unknowns, 
which are the samples to be measured and their dilutions, and standards, which are di- 
lutions of a known compound, used to calibrate the measurements. Figure 19.1 shows a 
typical plate with 96 wells; the first two columns of the plate contain the standard and its 
dilutions and the remaining ten columns are for unknown quantities. The dilution values 
for the unknown samples are more widely spaced than for the standards in order to cover 
a wider range of concentrations with a given number of assays. 


Laboratory data 


Figure 19.2 shows data from a single plate in a study of cockroach allergens in the homes 
of asthma sufferers. Each graph shows the optical measurements y versus the dilutions 
for a single compound along with an estimated curve describing the relationship between 
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Standards data Data from two of the unknown samples 

Conc. Dilution y Sample Dilution y Est. conc. 
0.64 1 101.8 Unknown 8 1 19.2 
0.64 1 121.4 1 19.5 j 
0.32 1/2 105.2 1/3 16.1 7 
0.32 1/2 114.1 1/3 15.8 * 
0.16 1/4 92.7 1/9 14.9 = 
0.16 1/4 93.3 1/9 14.8 i 
0.08 1/8 72.4 1/27 14.3 $ 
0.08 1/8 61.1 1/27 16.0 ü 
0.04 1/16 57.6 Unknown 9 1 49.6 0.040 
0.04 1/16 50.0 1 43.8 0.031 
0.02 1/32 38.5 1/3 24.0 0.005 
0.02 1/32 35.1 1/3 24.1 0.005 
0.01 1/64 26.6 1/9 17.3 
0.01 1/64 25.0 1/9 17.6 * 

0 0 14.7 1/27 15.6 7 

0 0 14.2 1/27 17.1 i 


Figure 19.3 Example of measurements y from a plate as analyzed by standard software used for 
dilution assays. The standards data are used to estimate the calibration curve, which is then used 
to estimate the unknown concentrations. The concentrations indicated by asterisks are labeled as 
‘below detection limit.’ However, information is present in these low observations, as can be seen 


by noting the decreasing pattern of the measurements from dilutions 1 to Z to 3 in each sample. 


dilution and measurement. The estimation of the curves relating dilutions to measurements 
is described below. 


Figure 19.3 illustrates difficulties with a currently standard approach to estimating un- 
known concentrations. The left part of the figure shows the standards data (corresponding 
to the first graph in Figure 19.2): the two initial samples have known concentrations of 
0.64, with each followed by several dilutions and a zero measurement. The optical color 
measurements y start above 100 for the samples with concentration 0.64 and decrease to 
around 14 for the zero concentration (all inert compound) samples. The right part of Fig- 
ure 19.3 shows, for two of the ten unknowns on the plate, the color measurements y and 
corresponding concentration estimates from a standard method. 


All the estimates for Unknown 8 are shown by asterisks, indicating that they were 
recorded as ‘below detection limit,’ and the standard computer program for analyzing these 
data gives no estimate at all. A casual glance at the data (see the plot for Unknown 8 in 
Figure 19.2) might suggest that these data are indeed all noise, but a careful look at the 
numbers reveals that the measurements decline consistently from concentrations of 1 to ł 
m E, with only the final dilutions apparently lost in the noise (in that the measurements at 


37 are no lower than at 4). A clear signal is present for the first six measurements of this 


unknown sample. 

Unknown 9 shows a better outcome, in which four of the eight measurements are above 
detection limit. The four more diluted measurements yield readings that are below the 
detection limit. Once again, however, information seems to be present in the lower mea- 
surements, which decline consistently with dilution. 


As can be seen in Figure 19.2, Unknowns 8 and 9 are not extreme cases but rather 
are somewhat typical of the data from this plate. In measurements of allergens, even 
low concentrations can be important, and we need to be able to distinguish between zero 
concentrations and values that are merely low. The Bayesian inference described here 
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makes this distinction far more precisely than the previous method by which such data 
were analyzed. 


The model 


Notation. The parameters of interest in a study such as in Figures 19.1-19.3 are the 
concentrations of the unknown samples; we label these as 01,...,010 for the plate layout 
shown in Figure 19.1. The known concentration of the standard is denoted by #9. We label 
the concentration in well 7 as x; and use y; to denote the corresponding color intensity 
measurement, with i = 1,...,96 for our plate. Each x; is a specified dilution of one of the 
samples. We set up the model for the color intensity observations y in stages: a parametric 
model for the expected color intensity for a given concentration, measurement errors for 
the optical readings, errors introduced during the dilution preparation process, and, finally, 
prior distributions for all the parameters. 


Curve of expected measurements given concentration. We follow the usual practice in this 
field and fit the following four-parameter nonlinear model for the expected optical reading 
given concentration zx: 


p2 
1+ (2/83) 


where ĝı is the color intensity in the limit of zero concentration, $2 is the increase to 
saturation, 83 is the concentration at which the gradient of the curve turns, and (4 is the 
rate at which saturation occurs. All parameters are restricted to nonnegative values. This 
model is equivalent to a scaled and shifted logistic function of log({x). The model fits the 
data fairly well, as can be seen in Figure 19.2 on page 472. 


E(y|x, 8) = g(x, 8) = b1 + (19.1) 


Measurement error. The measurement errors are modeled as normally distributed with 


unequal variances: 
: N : Glas, p) i 2 
yvi ~ N| gln b) | 3 J oy], (19.2) 


where the parameter œ, which is restricted to lie between 0 and 1, models the pattern that 
variances are higher for larger measurements (for example, see Figure 19.2). The constant 
A in (19.2) is arbitrary; we set it to the value 30, which is in the middle of the range of the 
data. It is included in the model so that the parameter oy has a more direct interpretation 
as the error standard deviation for a ‘typical’ measurement. 

The model (19.2) reduces to an equal-variance normal model if a = 0 and approximately 
corresponds to the equal-variance model on the log scale if œ = 1. Getting the variance 
relation correct is important here because many of our data are at low concentrations, and 
we do not want our model to use these measurements but not to overstate their precision. 


Dilution errors. The process introduces errors in two places: the initial dilution, in which 
a measured amount of the standard is mixed with a measured amount of an inert liquid; 
and serial dilutions, in which a sample is diluted by a fixed factor such as 2 or 3. For 
the cockroach allergen data, serial dilution errors were low, so we include only the initial 
dilution error in our model. 

We use a normal model on the (natural) log scale for the initial dilution error associated 
with preparing the standard sample. The known concentration of the standard solution 
is ĝo, and di™* is the (known) initial dilution of the standard that is called for. Without 
dilution error, the concentration of the initial dilution would thus be di™@). Let ziti* be 
the actual (unknown) concentration of the initial dilution, with 


log(ainit) ~ N(log(dini* - 0o), (a n, (19.3) 
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For the unknowns, there is no initial dilution, and so the unknown initial concentration for 
sample 7 is ad = 6; for j = 1,...,10. For the further dilutions of standards and unknowns, 
we simply set 

ti = di- ak, (19.4) 
where j(i) is the sample (0, 1, 2, ..., or 10) corresponding to observation i, and d; is the 
dilution of observation i relative to the initial dilution. (The d;’s are the numbers displayed 
in Figure 19.1.) The relation (19.4) reflects the assumption that serial dilution errors are 
low enough to be ignored. 


Prior distributions. We assign noninformative uniform prior distributions to the param- 
eters of the calibration curve: log(8,) ~ U(—o00,00) for k = 1,...,4; a, ~ U(0,00); 
a ~ U(0,1). A design such as displayed in Figure 19.1 with lots of standards data allows 
us to estimate all these parameters fairly accurately. We also assign noninformative prior 
distributions for the unknown concentrations: p(log 0;) « 1 for each unknown j = 1,..., 10. 
Another option would be to fit a hierarchical model of the form, log6; ~ N(je,0%) (or, 
better still, a mixture model which includes the possibilty of some true concentrations 6; 
to be zero), but for simplicity we use a no-pooling model (corresponding to og = oo) in this 
analysis. 


There is one parameter in the model—o™, the scale of the initial dilution error—that 
cannot be estimated from a single plate. For our analysis, we fix it at the value 0.02 (that is, 
an initial dilution error with standard deviation 2%), which was obtained from a previous 
analysis of data from plates with several different initial dilutions of the standard. 


Inference 


We fitted the model using the Bugs package (a predecessor to the Stan software described in 
Appendix C). We obtained approximate convergence (the potential scale reduction factors 
R were below 1.1 for all parameters) after 50,000 iterations of two parallel chains of the 
Gibbs sampler. To save memory and computation time in processing the simulations, 
we save every 20th iteration of each chain. When fitting the model, it is helpful to use 
reasonable starting points (which can be obtained using crude estimates from the data) 
and to parameterize in terms of the logarithms of the parameters 6; and the unknown 
concentrations 0;. To speed convergence, we actually work with the parameters log 8, a, oy 
and log y, where log(y;) = log(6;/3). The use of yj in place of 0; gets around the problem 
of the strong posterior correlation between the unknown concentrations and the parameter 
(3, which indexes the x-position of the calibration curve (see (19.1)). 


The posterior median estimates (and posterior 50% intervals) for the parameters of the 
calibration curve are 8, = 14.7 (14.5, 14.9], Bo = 99.7 (96.8, 102.9], Ês = 0.054 (0.051, 0.058}, 
and By = 1.34 [1.30, 1.38]. The posterior median estimate of 3 defines a curve g(a, 3) which 
is displayed in the upper-left plot of Figure 19.2. As expected, the curve goes through the 
data used to estimate it. The variance parameters oy and a are estimated as 2.2 and 0.97 
(with 50% intervals of [2.1, 2.3] and [0.94, 0.99], respectively). The high precision of the 
measurements (as can be seen from the replicates in Figure 19.2) allowed the parameters 
to be accurately estimated from a relatively small dataset. 


Figure 19.4 displays the inferences for the concentrations of the 10 unknown samples. 
We used these estimates, along with the estimated calibration curve, to draw scaled curves 
for each of the 10 unknowns displayed in Figure 19.2. Finally, Figure 19.5 displays the 
residuals, which seem generally reasonable. 
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Figure 19.4 Posterior medians, 50% intervals, and 95% intervals for the concentrations of the 10 
unknowns for the data displayed in Figure 19.2. Estimates are obtained for all the samples, even 
Unknown 8, all of whose data were ‘below detection limit’ (see Figure 19.3). 
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Figure 19.5 Standardized residuals (yi — E(yi|xi))/sd(yi|xi)) vs. expected values E(yi|xi), for the 
model fit to standards and unknown data from a single plate. Circles and crosses indicate mea- 
surements from standards and unknowns, respectively. No major problems appear with the model 


fit. 


Comparison to existing estimates 


The method that is standard practice in the field involves first estimating the calibration 
curve and then transforming each measurement from the unknown samples directly to 
an estimated concentration, by inverting the fitted calibration curve. For each unknown 
sample, the estimated concentrations are then divided by their dilutions and averaged to 
obtain a single estimate. (For example, using this approach, the estimated concentration for 
Unknown 9 from the data displayed in Figure 19.3 is + (0.040 + 0.031 +3 -0.005 +3- 0.005) = 
0.025.) 

The estimates from the Bayesian analysis are generally similar to those of the standard 
method but with higher accuracy. An advantage of the Bayesian approach is that it yields 
a concentration estimate for all unknowns, even Unknown 8 for which there is no stan- 
dard estimate because all its measurements are ‘below detection limit.’ We also created 
concentration estimates for each unknown based on each of the two halves of the data (in 
the setup of Figure 19.1, using only the top four wells or the bottom four wells for each 
unknown). For the standard and Bayesian approaches the two estimates are similar, but 
the reliability (that is, the agreement between the two estimates) is much stronger for the 
Bayesian estimate. We would not want to make too strong a claim based on data from a 
single plate. We performed a more thorough study (not shown here) to compare the old 
and new methods under a range of experimental conditions. 
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Figure 19.6 Estimated fraction of PERC metabolized, as a function of steady-state concentration 
in inhaled air, for 10 hypothetical individuals randomly selected from the estimated population of 
young adult white males. 


19.2 Example: population toxicokinetics 


In this section we discuss a much more complicated nonlinear model used in toxicokinetics 
(the study of the flow and metabolism of toxins in the body) for the ultimate goal of 
assessing the risk in the general population associated with a particular air pollutant. This 
model is hierarchical and multivariate, with a vector of parameters to be estimated on each 
of several experimental subjects. The prior distributions for this model are informative and 
hierarchical, with separate variance components corresponding to uncertainty about the 
average level in the population and variation around that average. 


Background 


Perchloroethylene (PERC) is one of many industrial products that cause cancer in animals 
and is believed to do so in humans as well. PERC is breathed in, and the general under- 
standing is that it is metabolized in the liver and that its metabolites are carcinogenic. 
Thus, a relevant ‘dose’ to study when calibrating the effects of PERC is the amount me- 
tabolized in the liver. Not all the PERC that a person breathes will be metabolized. We 
focus here on estimating the fraction metabolized as a function of the concentration of the 
compound in the breathed air, and how this function varies across the population. To give 
an idea of our inferential goals, we skip ahead to show some output from our analysis. Fig- 
ure 19.6 displays the estimated fraction of inhaled PERC that is metabolized as a function 
of concentration in air, for 10 randomly selected draws from the estimated population of 
young adult white males (the group on which we had data). The shape of the curve is 
discussed below after the statistical modeling is described. 

It is not possible to estimate curves of this type with reasonable confidence using simple 
procedures such as direct measurement of metabolite concentrations (difficult even at high 
exposures and not feasible at low exposures) or extrapolation from animal results. Instead 
a mathematical model of the flow of the toxin through the bloodstream and body organs, 
and of its metabolism in the liver is used to estimate the fraction of PERC metabolized. 

A sample of the experimental data we used to fit the model of toxin flow is shown in 
Figure 19.7. Each of six volunteers was exposed to PERC at a high level for four hours 
(believed long enough for the PERC concentrations in most of their bodily organs to come 
to equilibrium) and then PERC concentrations in exhaled air and in blood were measured 
over a period of a week (168 hours). In addition, the data on each person were repeated at 
a second exposure level (data not shown). 
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Figure 19.7 Concentration of PERC (in milligrams per liter) in exhaled air and in blood, over time, 
for one of two replications in each of six experimental subjects. The measurements are displayed 
on logarithmic scales. 


Toxicokinetic model 


Our analysis is based on a standard physiological model, according to which the toxin enters 
and leaves through the breath, is distributed by blood flow to four ‘compartments’—well- 
perfused tissues, poorly perfused tissues, fat, and the liver—and is metabolized in the liver. 
This model has a long history in toxicology modeling and has been showed to reproduce 
most features of such data. A simpler one- or two-compartment model might be easier to 
estimate, but such models provide a poor fit to our data and, more importantly, do not 
have the complexity to accurately fit varying exposure conditions. 

We briefly describe the nature of the toxicokinetic model, omitting details not needed 
for understanding our analysis. Given a known concentration of the compound in the 
air, the concentration of the compound in each compartment over time is governed by a 
first-order differential equation, with parameters for the volume, blood flow, and partition 
coefficient (equilibrium concentration relative to the blood) of each compartment. The liver 
compartment where metabolism occurs has a slightly different equation than the other com- 
partments and is governed by the parameters mentioned above and a couple of additional 
parameters. The four differential equations give rise to a total of 15 parameters for each 
individual. We use the notation 0, = (041,...,9@.) for the vector of L = 15 parameters 
associated with person k. 

Given the values of the physiological parameters and initial exposure conditions, the 
differential equations can be solved using specialized numerical algorithms to obtain con- 
centrations of the compound in each compartment and the rate of metabolism as a function 
of time. We can combine predictions about the PERC concentration in exhaled air and 
blood based on the numerical solution of the differential equations with our observed con- 
centration measurements to estimate the model parameters for each individual. 


Difficulties in estimation and the role of prior information 


A characteristic difficulty of estimating models in toxicology and pharmacology is that 
they predict patterns of concentration over time that are close to mixtures of declining 
exponential functions, with the amplitudes and decay times of the different components 
corresponding to functions of the model parameters. It is well known that the estimation 
of the decay times of a mixture of exponentials is an ill-conditioned problem (see Exercise 
19.3); that is, the parameters in such a model are hard to estimate simultaneously. 
Solving the problem of estimating metabolism from indirect data is facilitated by using 
a physiological pharmacokinetic model; that is, one in which the individual and population 
parameters have direct physical interpretations (for example, blood flow through the fatty 
tissue, or tissue/blood partition coefficients). These models permit the identification of 
many of their parameter values through prior (for example, published) physiological data. 
Since the parameters of these models are essentially impossible to estimate from the data 


This electronic edition is for non-commercial purposes only. 


19.2. EXAMPLE: POPULATION TOXICOKINETICS 479 


alone, it is crucial that they have physical meaning and can be assigned informative prior 
distributions. 


Measurement model 


We first describe how the toxicological model is used as a component of the nonlinear 
model for blood and air concentration measurements. Following that is a description of 
the population model which allows us to infer the distribution of population characteristics 
related to PERC metabolism. The data are a series of measurements of exhaled air and 
blood concentrations taken on each of the six people in the study. We label these data as 
Yjkmt, With j indexing replications (j = 1,2 for the two exposure levels in our data), k 
indexing individuals, m indexing measurements (m = 1 for blood concentration and m = 2 
for air concentration), and t indexing time. The expected values of the exhaled air and blood 
concentrations are nonlinear functions gm(6x,£;,t) of the individual’s parameters 6;, the 
exposure level Æj, and time t. The functions g,,(-) are our shorthand notation for the 
solution of the system of differential equations relating the physiological parameters to the 
expected concentration. Given the input conditions for replication j (that is, Æj) and the 
parameters 6; (as well as a number of additional quantities measured on each individual 
but suppressed in our notation here), one can numerically evaluate the pharmacokinetic 
differential equations over time and compute gı and gə for all values at which measurements 
have been taken, thus obtaining the expected values of all the measurements. 

The concentrations actually observed in expired air and blood are also affected by mea- 
surement errors, which are assumed, as usual, to be independent and lognormally dis- 
tributed, with a mean of zero and a standard deviation of om (on the log scale) for m = 1, 2. 
These measurement error distributions also implicitly account for errors in the model. We 
allow the two components of ø to differ, because the measurements in blood and exhaled air 
have different experimental protocols and therefore are likely to have different precisions. 
We have no particular reason to believe that modeling or measurement errors for air and 
blood measurements will be correlated, so we assign independent uniform prior distributions 
to logo, and logog. (After fitting the model, we examined the residuals and did not find 
any evidence of high correlations.) 


Population model for parameters 


One of the goals of this project is to estimate the distribution of the individual pharmacoki- 
netic parameters and of predicted values such as fraction metabolized (which are complex 
functions of the individual parameters), in the general population. In an experiment with 
K individuals, we set up a hierarchical model on the K vectors of parameters to allow us 
to draw inferences about the general population from which the individuals are drawn. 

A skewed, lognormal-like distribution is generally observed for biological parameters. 
Most, if not all, of the biological parameters also have physiological bounds. Based on 
this information the individual pharmacokinetic parameters after log-transformation and 
appropriate scaling (see below), are modeled with normal distributions having population 
mean truncated at +3 standard deviations from the mean, where k indexes individuals and 
l = 1,..., L indexes the pharmacokinetic parameters in the model. The distributions are 
truncated to restrict the model parameters to scientifically reasonable values. In addition, 
the truncations serve a useful role when we monitor the simulations of the parameters from 
their posterior distribution: if the simulations for a parameter are stuck near truncation 
points, this indicates that the data and the pharmacokinetic model strongly contradict the 
prior distribution, and some part of the model should be re-examined. 

The vector of parameters for individual k is 0, = (@41,---, 9x), with L = 15. Some of 
the parameters are constrained by definition: in the model under discussion, the parameters 
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0x2, 0k3, Oa, 0ks represent the fractions of blood flow to each compartment, and so are con- 
strained to sum to 1. Also, the parameters 046, 947, 0ks correspond to the scaling coefficients 
of the organ volumes, and are constrained to sum to 0.873 (the standard fraction of lean 
body mass not including bones), for each individual. Of these three parameters, Ogg, the 
volume of the liver, is much smaller than the others and there is considerable prior infor- 
mation about this quantity. For the purposes of modeling and computation, we transform 
the model in terms of a new set of parameters Yp; defined as follows: 


eri 
On = etre + elks + evra + etrs’ for | = 2,3,4,5 
4 evi 
— ss k8 — = 
On. = (0.873—e P Tie for l = 6,7 
Ok = e”, forl =1 and 8-15. (19.5) 
The parameters k2, ..., Wks and Yke, Ypy are not identified (for example, adding any con- 
stant to Wr2,---, Yks does not alter the values of the physiological parameters, 642,..., 0ks), 


but they are assigned proper prior distributions, so we can formally manipulate their pos- 
terior distributions. 

Each set of wz; parameters is assumed to follow a normal distribution with mean y and 
standard deviation 7, truncated at three standard deviations. Modeling on the scale of w 
respects the constraints on 0 while retaining the truncated lognormal distributions for the 
unconstrained components. All computations are performed with the w’s, which are then 
transformed back to 6’s at the end to interpret the results on the natural scales. 

In the model, the population distributions for the L = 15 physiological parameters are 
assumed independent. After fitting the model, we checked the 15 - 14/2 correlations among 
the parameter pairs across the six people and found no evidence that they differed from 
zero. If we did find large correlations, we would either want to add correlations to the 
model (as described in Section 15.4) or reparameterize to make the correlations smaller. 
In fact, the parameters in our model were already transformed to reduce correlations (for 
example, by working with proportional rather than absolute blood flows and organ volumes, 
as described at the top of this page. 


Prior information 


In order to fit the population model, we assign prior distributions to the means and vari- 
ances, fi; and 7/, of the L (transformed) physiological parameters. We specify a prior dis- 
tribution for each j; (normal with parameters M, and S? based on substantive knowledge) 
and 77 (inverse-y?, centered at an estimate 74, of the true population variance and with a 
low number of degrees of freedom 1;—typically set to 2—to indicate large uncertainties). 

The hyperparameters Mı, Sı, and To, are based on estimates available in the biological 
literature. Sources include studies on humans and allometric scaling from animal mea- 
surements. We set independent prior distributions for the u's and 7;’s because our prior 
information about the parameters is essentially independent, to the best of our knowledge, 
given the parameterization and scaling used (for example, blood flows as a proportion of 
the total rather than on absolute scales). In setting uncertainties, we try to be weakly 
informative and set the prior variances higher rather than lower when there is ambiguity in 
the biological literature. 

The model has 15 parameters for each of the six people in a toxicokinetic experiment; 
Ojk is the value of the kth parameter for person j, with j = 1,...,6 and k = 1,...,15. 
Prior information about the parameters was available in the biological literature. For each 
parameter, it was important to distinguish between two sources of variation: prior uncer- 
tainty about the value of the parameter and population variation. This was represented by 
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a lognormal model for each parameter, log 0j ~ N(uk, T2), and by assigning independent 
prior distributions to the population geometric mean and standard deviation, uw, and Tx: 
uk ~ N(Mx, SẸ) and r? ~ Inv-x?(v, rf). The prior distribution on ug, especially through 
the standard deviation Sk, describes our uncertainty about typical values of the parameter 
in the population. The prior distribution on 7; tell us about the population variation for the 
parameter. Because prior knowledge about population variation was imprecise, the degrees 
of freedom in the prior distributions for a were set to the low value of v = 2. 

Some parameters are better understood than others. For example, the weight of the 
liver, when expressed as a fraction of lean body weight, was estimated to have a population 
geometric mean of 0.033, with both the uncertainty on the population average and the 
heterogeneity in the population estimated to be of the order of 10% to 20%. The prior 
parameters were set to Mp = log(0.033), Sk = log(1.1), and Tko = log(1.1). In contrast, 
the Michaelis-Menten coefficient (a particular parameter in the pharmacokinetic model) was 
poorly understood: its population geometric mean was estimated at 0.7, but with a possible 
uncertainty of up to a factor of 100 above or below. Despite the large uncertainty in the 
magnitude of this parameter, however, it was believed to vary by no more than a factor of 4 
relative to the population mean, among individuals in the population. The prior parameters 
were set to Mp = log(0.7), Sp = log(10), and Tko = log(2). The hierarchical model provides 
an essential framework for expressing the two sources of variation (or uncertainty) and 
combining them in the analysis. 


Joint posterior distribution for the hierarchical model 


For Bayesian inference, we obtain the posterior distribution (up to a multiplicative constant) 
for all the parameters of interest, given the data and the prior information, by multiplying 
all the factors in the hierarchical model: the data distribution, p(y|w, E, t,o), the population 
model, p(Y|u, T), and the prior distribution, p(y, 7|, S, 70), 


p(w, u, T’, 07 |y, E,t, d, M, S, T8, v) 
x plyly, >, E, t,07)p(b|u, 7?)p(u,77|M, S, 75) p(o7) 


JK 2 
x II ai II | [ Nog yskmel log gm (Ox, Bj, t), 02,) o O° X 


j=1 k=1 m=1 t 


K L L 
x (i TI Seneu) (11 Nl, P02 (Fr) ) (19.6) 


k=11=1 l=1 


where w is the set of vectors of individual-level parameters, u and 7 are the vectors of 
population means and standard deviations, ø is the pair of measurement variances, y is the 
vector of concentration measurements, Æ and t are the exposure concentrations and times, 
and M, S, 7, and v are the hyperparameters. We use the notation Ntrunc for the normal 
distribution truncated at the specified number of standard deviations from the mean. The 
indexes j, k, l, m, and t refer to replication, person, parameter, type of measurement (blood 
or air), and time of measurement. To compute (19.6) as a function of the parameters, data, 
and experimental conditions, the functions gm must be computed numerically over the range 
of time corresponding to the experimental measurements. 


Computation 


Our goals are first to fit a pharmacokinetic model to experimental data, and then to use the 
model to perform inferences about quantities of interest, such as the population distribution 
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of the fraction of the compound metabolized at a given dose. We attain these goals using ran- 
dom draws of the parameters from the posterior distribution, p(w, u, T, oly, E, t, M, S, To, v). 
We use a Gibbs sampling approach, iteratively updating the parameters in the following 
sequence: 0,7, H, U1,...,WK. Each of these is actually a vector parameter. The conditional 
distributions for the components of o?, T?, and u are inverse-y”, inverse-y?, and normal. 
The conditional distributions for the parameters w have no closed form, so we sample from 
them using steps of the Metropolis algorithm, which requires only the ability to compute 
the posterior density up to a multiplicative constant, as in (19.6). 

Our implementation of the Metropolis algorithm alters the parameters one person at a 
time (thus, K jumps in each iteration, with each jump affecting an L-dimensional vector 
wx). The parameter vectors are altered using a normal proposal distribution centered at the 
current value, with covariance matrix proportional to one obtained from some initial runs 
and scaled so that the acceptance rate is approximately 0.23. Updating the parameters 
of one person at a time means that the only factors of the posterior density that need 
to be computed for the Metropolis step are those corresponding to that person. This is 
an important concern, because evaluating the functions gm to obtain expected values of 
measurements is the costliest part of the computation. An alternative approach would be 
to alter a single component pp at a time; this would require KL jumps in each iteration. 

We performed five independent simulation runs, each of 50,000 iterations, with starting 
points obtained by sampling each Yp; at random from its prior distribution and then setting 
the population averages pı at their prior means, Mı. We then began the simulations by 
drawing o and T. Because of storage limitations, we saved only every tenth iteration of the 
parameter vector. We monitored the convergence of the simulations by comparing within 
and between-simulation variances, as described in Section 11.4. In practice, the model was 
gradually implemented and debugged over a period of months, and one reason for our trust 
in the results is their general consistency with earlier simulations of different variants of the 
model. 


Inference for quantities of interest 


First off, we examined the inferences about the model parameters and their population vari- 
ability, and checked that these made sense and were consistent with the prior distribution. 
After this, the main quantities of interest were the fraction of PERC metabolized under 
different exposure scenarios, as computed by evaluating the differential equation model nu- 
merically under the appropriate input conditions. For each individual k, we can compute 
the fraction metabolized for each simulated parameter vector w,; using the set of simula- 
tions yields a distribution of the fraction metabolized for that individual. The variance in 
the distribution for each individual is due to uncertainty in the posterior distribution of the 
physiological parameters, wp. 

Figure 19.8 shows the posterior distributions of the PERC fraction metabolized, for each 
person in the experiment, at a high exposure level of 50 parts per million (ppm) and a low 
level of 0.001 ppm. The six people were not actually exposed to these levels; the inferences 
were obtained by running the differential equation model with these two hypothesized input 
conditions along with the estimated parameters for each person. We selected these two 
levels to illustrate the inferences from the model; the high level corresponds to occupational 
exposures and the low level to environmental exposures of PERC. We can and did consider 
other exposure scenarios too. The figure shows the correlation between the high-dose and 
the low-dose estimates of the fraction metabolized in the six people. Large variations exist 
between individuals; for example, a factor of two difference is seen between subjects A and 
E. 

Similar simulations were performed for an additional person from the population (that 
is, a person exchangeable with the subjects in the study) by simulating random vectors of 
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Figure 19.8 Posterior inferences for the quantities of interest—the fraction metabolized at high and 
low exposures—for each of the six subjects in the PERC experiment. The scatter within each plot 
represents posterior uncertainty about each person’s metabolism. The variation among these six 
persons represents variation in the population studied of young adult white males. 


the physiological parameters from their population distributions. The variance in the result- 
ing population distribution of fraction of PERC metabolized includes posterior uncertainty 
in the parameter estimates and real inter-individual variation in the population. Interval 
estimates for the fraction metabolized can be obtained as percentiles of the simulated dis- 
tributions. At high exposure (50 ppm) the 95% interval for the fraction metabolized in the 
population is [0.5%, 4.1%]; at low exposure (0.001 ppm) it is [15%, 58%]. 

We also studied the fraction of PERC metabolized in one day (after three weeks of 
inhalation exposure) as a function of exposure level. This is the relation we showed in 
introducing the PERC example; it is shown in Figure 19.6. At low exposures the fraction 
metabolized remains constant, since metabolism is linear. Saturation starts occurring above 
1 ppm and is about complete at 10 ppm. At higher levels the fraction metabolized decreases 
linearly with exposure since the quantity metabolized per unit time is at its maximum. 

When interpreting these results, one must remember that they are based on a single 
experiment. This study appears to be one of the best available; however, it included only 
six people from a homogeneous population, measured at only two exposure conditions. 
Much of the uncertainty associated with the results is due to these experimental limitations. 
Posterior uncertainty about the parameters for the people in the study could be reduced 
by collecting and analyzing additional data on these individuals. To learn more about the 
population we would need additional individuals. Population variability, which in this study 
is approximately as large as posterior uncertainty, might increase if a more heterogeneous 
group of subjects were included. 


Evaluating the fit of the model 


In addition to their role in inference given the model, the posterior simulations can be used 
in several ways to check the fit of the model. 
Most directly, we can examine the errors of measurement and modeling by comparing 
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Figure 19.9 Observed PERC concentrations (for all individuals in the study) divided by expected 
concentrations, plotted vs. expected concentrations. The x and y-axes are on different (logarithmic) 
scales: observations vary by a factor of 10,000, but the relative errors are mostly between 0.8 and 
1.25. Because the expected concentrations are computed based on a random draw of the parameters 
from their posterior distribution, the figure shows the actual misfit estimated by the model, without 
the need to adjust for fitting. 
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Figure 19.10 External validation data and 95% predictive intervals from the model fit to the PERC 
data. The model predictions fit the data reasonably well but not in the first 15 minutes of exposure, a 
problem we attribute to the fact that the model assumes that all compartments are in instantaneous 
equilibrium, whereas this actually takes about 15 minutes to approximately hold. 


observed data, Yjkmt, to their expectations, gm(0kx, Æj, t), for all the measurements, based 
on the posterior simulations of 0. Figure 19.9 shows a scatterplot of the relative prediction 
errors of all our observed data (that is, observed data divided by their predictions from 
the model) versus the model predictions. (Since the analysis was Bayesian, we have many 
simulation draws of the parameter vector, each of which yields slightly different predicted 
data. Figure 19.9, for simplicity, shows the predictions from just one of these simulation 
draws, selected at random.) The magnitude of these errors is reasonably low compared to 
other fits of this kind of data. 


We can also check the model by comparing its predictions to additional data not used in 
the original fit. We use the results of a second inhalation experiment on human volunteers, 
in which 6 people were exposed to constant levels of PERC ranging from 0.5 to 9 ppm 
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(much lower than the concentrations in our study) and the concentration in exhaled air 
and blood was measured during exposure for up to 50 minutes (a much shorter time period 
than in our study). Since these are new individuals we created posterior simulations of the 
blood/exhaled air concentration ratio (this is the quantity studied by the investigators in 
the second experiment) by using posterior draws from the population distribution p(0|u, T) 
as the parameters in the nonlinear model. Figure 19.10 presents the observed data and the 
model prediction (with 95% and 99% simulation bounds). The model fit is good overall, 
even though exposure levels were 5 to 100 times lower than those used in our data. However, 
short-term kinetics (less than 15 minutes after the onset of exposure) are not well described 
by the model, which includes only a simple description of pulmonary exchanges. 


Use of a complex model with an informative prior distribution 


Our analysis has five key features, all of which work in combination: (1) a physiological 
model, (2) a population model, (3) prior information on the population physiological pa- 
rameters, (4) experimental data, and (5) Bayesian inference. If any of these five features 
are missing, the model will not work: (1) without a physiological model, there is no good 
way to obtain prior information on the parameters, (2) without a population model, there 
is not generally enough data to estimate the model independently on each individual, (3 
and 4) the parameters of a multi-compartment physiological model cannot be determined 
accurately by data or prior information alone, and (5) Bayesian inference yields a distri- 
bution of parameters consistent with both prior information and data, if such agreement 
is possible. Because it automatically includes both inferential uncertainty and population 
variability, the hierarchical Bayesian approach yields a posterior distribution that can be 
directly used for an uncertainty analysis of the risk assessment process. 


19.3 Bibliographic note 


Carroll, Ruppert, and Stefanski (1995) is a recent treatment of nonlinear statistical models. 
Giltinan and Davidian (1995) discuss hierarchical nonlinear models. Reilly and Zeringue 
(2004) show how a simple Bayesian nonlinear predator-prey model can outperform classical 
time series models for an animal-abundance example. 


An early example of a serial dilution assay, from Fisher (1922), is discussed by McCullagh 
and Nelder (1989, p. 11). Assays of the form described in Section 19.1 are discussed by 
Racine-Poon, Weihs, and Smith (1991) and Higgins et al. (1998). The analysis described in 
Section 19.1 appears in Gelman, Chew, and Shnaidman (2004). 


The toxicology example is described in Gelman, Bois, and Jiang (1996). Hierarchical 
pharmacokinetic models have a long history; see, for example, Sheiner, Rosenberg, and 
Melmon (1972), Sheiner and Beal (1982), Wakefield (1996), and the discussion of Wakefield, 
Aarons, and Racine-Poon (1999). Other biomedical applications in which Bayesian analysis 
has been used for nonlinear models include magnetic resonance imaging (Genovese, 2001). 


Nonlinear models with large numbers of parameters are a bridge to classical nonpara- 
metric statistical methods and to methods such as neural networks that are popular in 
computer science. Neal (1996a) and Denison et al. (2002) discuss these from a Bayesian 
perspective. Chipman, George, and McCulloch (1998, 2002) give a Bayesian presentation 
of nonlinear regression-tree models. Bayesian discussions of spline models, which can be 
viewed as nonparametric generalizations of linear regression models, include Wahba (1978), 
DiMatteo et al. (2001), and Denison et al. (2002) among others. Zhao (2000) gives a theo- 
retical discussion of Bayesian nonparameteric models. 
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Distance Number of Number of 
(feet) tries successes 
2 1443 1346 
3 694 577 
4 455 337 
5 353 208 
6 272 149 
7 256 136 
8 240 111 
9 217 69 
10 200 67 
11 237 75 
12 202 52 
13 192 46 
14 174 54 
15 167 28 
16 201 27 
Lf 195 31 
18 191 33 
19 147 20 
20 152 24 


Table 19.1 Number of attempts and successes of golf putts, by distance from the hole, for a sample 
of professional golfers. From Berry (1996)*. 


19.4 Exercises 


1. Nonlinear modeling: The file dilution.dat contains data from the dilution assay ex- 
periment described in Section 19.1. 

(a) Use Stan to fit the model described in Section 19.1. 

(b) Fit the same model, but with a hierarchical mixture prior distribution on the 0;’s 
which includes the possibilty of some true concentrations 0; to be zero. Discuss your 
model, its parameters, and your hyperprior distribution for these parameters. 

(c) Compare the inferences for the 0;’s from the two models above. 

(d) Construct a dataset (with the same dilutions and the same number of unknowns and 
measurements), for which these two models yield much different inferences. 

2. Nonlinear modeling: Table 19.1 presents data on the success rate of putts by professional 
golfers (see Berry, 1996, and Gelman and Nolan, 2002c). 

(a) Fit a nonlinear model for the probability of success (using the binomial likelihood) as 
a function of distance. Does your fitted model make sense in the range of the data 
and over potential extrapolations? 

(b) Use posterior predictive checks to assess the fit of the model. 

3. Ill-posed systems: Generate n independent observations y; from the following model: 


yi ~ N(Ae~* + Be~°‘, g), where the predictors 71,..., £n are uniformly distributed 
on [0,10], and you have chosen some particular true values for the parameters. 


(a) Fit the model using a uniform prior distribution for the logarithms of the four param- 
eters. 


(b) Do simulations with different values of n. How large does n have to be until the 
Bayesian inferences match the true parameters with reasonable accuracy? 


1Reprinted with permission of Brooks/Cole, a division of Thomson Learning 
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Chapter 20 


Basis function models 


Chapter 19 considered nonlinear models with E(y|X, 3) = (Xi, 6), where u is a prespecified 
parametric nonlinear function of the predictors with unknown parameters ¢. In this and the 
following chapters we consider models where p is also a priori unknown. In later chapters 
we consider more flexible approaches that also allow the residual density to be unknown 
and potentially changing with predictors. 


20.1 Splines and weighted sums of basis functions 


To allow the mean to vary nonlinearly with predictors, one can replace X;3 with p(X;), 
where j(-) falls in some class of nonlinear functions. A variety of approaches are available 
for modeling this jz, including the use of basis function expansions and Gaussian processes 
(discussed in the following chapter). 

To illustrate the basis function approach, we start with one-dimensional regression mod- 
els in which u(x) is modeled as a sum, 


H 
p(x) = X Brba(c), 
h=1 


where b = {bn}; is a prespecified set of basis functions and 8 = (61,..., 8H) is a vector 
of basis coefficients. The Taylor series expansion is a familiar example in which the basis 
functions are polynomials of increasing degree, with which one can represent a function 
as a (possibly) infinite sum of terms. In practice, Taylor series expansions can require a 
huge number of terms to model a function well globally and, for statistical applications, 
typically have horrible properties near the boundary. By a more appropriate choice of a 
finite set of basis functions it should be possible to more accurately model functions that 
arise in practice. It has been found useful to use local basis functions which are centered 
on different locations and for which each basis function b, has a centering point £p so that 
bp(a) dimimishes to zero when z is far from xp. 
An often-used simple choice is the family of Gaussian radial basis functions, 


bn (x) = exp (-—) (20.1) 


where £p are centers of the basis functions and / is a common width parameter. The number 
of basis functions and the width parameter / controls the scale at which the model can vary 
as a function of x. 

Another commonly used family of basis functions is the B-spline, which is a piecewise 
continuous function that is defined conditional on some set of knots. Assuming uniform 
knot locations £h+k = £n + ôk, the cubic B-spline basis function is defined as the following 
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© 
Figure 20.1 Single Gaussian (solid line) and cubic B-spline (dashed line) basis functions scaled to 


have the same width. The X marks the center of the Gaussian basis function, and the circles mark 
the location of knots for the cubic B-spline. 


piecewise cubic polynomial: 


443 for x € (£h, %n41), u = (a—ap)/6 
£1 + 3u+ 3u? — 3u3) for x E€ (£h+1, h42), U= (£ — Xp41)/6 
balz) = ae — 6u? + 3u’) for © E€ (€h42,Un43), U= (£ — Th+2)/ô (20.2) 
ia — 3u + 3u? — u’) for x € (£h43,£h44), u= (£ — £h+3)/ 
0 otherwise. 


Here the width of the basis function is determined by distance 6 between knots, and the 
maximum flexibility of the model is controlled by the number of knots uniformly placed 
in the data range. Knot locations can also be set nonuniformly. B-splines have a more 
complex definition than Gaussian radial basis function, but each B-spline basis function 
has compact support, so the design matrix of the linear model is sparse which can be 
exploited in computation. 

Figure 20.1 shows single Gaussian and cubic B-spline basis functions. Both have smooth 
bell shapes. A weighted sum of such shapes (in which weights can be positive, negative, 
or zero) can be used to model smooth functions. Although the basis functions look very 
similar, Gaussian radial basis function will produce smoother functions as they are infinitely 
differentiable, while the cubic B-spline is only three times differentiable. 

Figure 20.2 shows a set of B-spline basis functions and realizations from the model ob- 
tained by sampling random weights 6; from a Gaussian prior distribution for these weights. 
The number of splines H impacts the flexibility of the resulting model for (a), as one 
cannot characterize finer scale features in u(x) than the splines chosen. For example, if 
there is a spike in u(x) that is narrower than the basis functions in Figure 20.2, then that 
spike will be oversmoothed. 

Conditionally on the selected basis b, the model is linear in the parameters. Hence, we 
can simply re-express the model as y; = f(x; ) +e; = wiB+e;, with w; = (b1(2;),..., ba (a:)). 
Because the resulting model is linear in the parameters 3, model fitting can proceed as in 
linear regression models. For example, a multivariate normal-inverse-y? prior for (8,07) 
is conjugate so that the posterior of (8,07) given the data (aj, y;)", is also multivariate 
normal-inverse-y?. However, even though the model is linear in the parameters 8, a rich 
class of functions can be accurately approximated as linear combination of basis functions. 
It is often useful to center the basis function model to linear model 


+2 


u(x) = Br + Bow + X` Brbn(z). 


h=3 
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u(x) 


Figure 20.2 (a) A set of cubic B-splines with equally spaced knots. (b) A set of random draws from 
the B-spline prior for u(x) based on the basis functions in the left graph, assuming independent 
standard normal priors for the basis coefficients. 
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Figure 20.3 A small dataset of concentration of chloride over time in a biology experiment. Data 
points are circles, the linear regression estimate is shown with a dotted line, and the posterior mean 
curve using B-splines is the curved solid line. 


Example. Chloride concentration 

We illustrate with a small dataset from a biology experiment containing 54 measure- 
ments of the concentration of chloride taken over a short time interval; see Figure 20.3, 
which shows raw data, a fitted straight line regression, and the posterior mean of the 
regression function (that is, E(u(x)|y)) as a function of x, averaging over the posterior 
distribution of the parameters 3) from a fitted B-spline model. The data are close to 
linear but with some notable local deviations. In this case, there are 21 coefficients 
to be estimated ((1,..., 821) but only 54 data points so it becomes problematic to 
estimate all of the basis coefficients without incorporating prior information. There 
are a variety of strategies that can be taken to accommodate such data sparsity. One 
possibility is to center the nonparametric prior for the curve u(x) on a parametric 
function, such as a linear model. 

Let Blo ~ N(8o,07A~1Iy7) and o? ~ Inv-gamma(ao, bo). This implies that the prior 
expectation for the curve at predictor value x is po(x) = E u(x) = ys Bonba (x). 
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Supposing that po(x) = a+ va, so that the prior mean is linear, we can use least 
squares to estimate the values of 8o producing uo(x) as close as possible to a+ wx; we 
find that for H = 21 in this application, o(x) is indistinguishable from a + Wa using 
this approach. For simplicity, one can plug in the least squares estimates for a and w. 
The resulting posterior mean is 


Alx) = E(u(x)|(21, 1), <- <, (En, Yn)) = (WTW + AT) (Wy + Afio(z)), 


with jio(x) the estimated least squares regression line and W = (wi,...,wn)’. The 
posterior mean ji (a) will be shrunk towards the linear regression estimate, addressing 
the data sparsity issue while allowing nonparametric deviations from the linear regres- 
sion fit. For a more complete analysis, one can place hyperpriors on q, or can choose 
a smoothing prior which favors similar values for adjacent basis coefficients; first-order 
autoregressive priors are often used leading to Bayesian penalized (P) splines. 


In applying splines, an important aspect of the specification is the number of knots 
and their locations. In many applications, it works well to choose sufficiently many knots, 
such as H = 21 in the above example, while also carefully choosing the prior for the basis 
coefficients to limit problems with over-fitting and data sparsity. However, several different 
Bayesian approaches are available for accommodating uncertainty in basis function spec- 
ification. The first is to consider a free knot approach with a prior on the number and 
locations of knots in a kernel or spline model, using reversible jump MCMC (see Section 
12.3) for posterior computation. The resulting posterior distribution for (4,0) will allow 
for uncertainty through model averaging over the posterior for the number and locations 
of knots. Although this approach is conceptually appealing, the computational implemen- 
tation is a major hurdle. In particular, designing efficient reversible jump algorithms can 
be challenging. A second possibility is to relax the variable selection problem by choosing 
priors that do not set the 6; coefficients exactly equal to zero but instead shrink many 
of the coefficients to near-zero values, while having heavy tails to avoid over-shrinking the 
coefficients for the important basis functions. A shrinkage prior that is concentrated near 
zero with heavy tails can be thought to provide a continuous analogue to variable selection 
priors, with the shrinkage priors having conceptual and computational advantages by not 
having to jump discontinuously between zero and nonzero values. 

In this chapter, we first describe the variable selection approach, including basic details 
for how to proceed with posterior computation and inferences. We will then outline meth- 
ods for shrinkage, which have some practical advantages over the formal variable selection 
approach. Extensions to multivariate regression with p > 1 will be described, and we will 
provide an introduction to the use of Gaussian process priors as an alternative to explicit 
basis representations. 


20.2 Basis selection and shrinkage of coefficients 


Focusing on the nonparametric regression model with Gaussian residuals and letting b = 
{bn L] denote a prespecified collection of potential basis functions, we have 


yi ~ N(wiB,07), wi = (b1(a;),..-, bg (xi)). 


In practice, there is typically uncertainty in which basis functions are really needed. To 
allow basis functions to be excluded from the model using Bayesian variable selection, we 
introduce a model index y = (1,.-.,ya) € T, with y, = 1 denoting that basis function 
bn should be included and y;, = 0 otherwise. Here, T is a model space containing the 2# 
possibilities for y ranging from exclusion of all basis functions (denoted y = 0p) to inclusion 
of all basis functions (denoted y = 17). 
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To complete a Bayesian specification, we require a prior over the model space I as well 
as a prior for the nonzero coefficients By = {Bn : Yh = 1} in each model y. A simple prior 
specification relies on embedding all of the models in the list I in the full model by letting 


Bn ~ Thõo + (1 — mh)N(0, K3 107), o° ~ Inv-gamma(a, b), (20.3) 


with ôo denoting a degenerate distribution with all its mass at zero. Prior (20.3) sets 8 = 0 
with probability m}, and otherwise draws a nonzero coefficient from a N(0, KO") prior. 
This implies yn ~ Bernoulli(1— 7p) independently for h = 1,...,H and 8, ~ Np, (0, Vio”), 
with p, = >>, 7a the number of basis functions in model y and V} = diag(k, : yn = 1). 
Model (20.3) can be called a variable selection mixture prior. 

In the absence of prior knowledge that certain basis coefficients are more likely to be 
included, one can let ma = 7 and then choose a hyperprior 7 ~ Beta(ar, br) to allow the 
data to inform more strongly about the model size. Such a prior also induces an automatic 
Bayesian multiplicity adjustment, which leads to an increasing tendency to set coefficients 
to zero the more unnecessary basis coefficients are added. This adjustment is clear from the 
full conditional posterior distribution for 7, which has the simple form 7|— ~ Beta(a, + 
Xa — Yn), bz +}, ya). To induce a heavy-tailed Cauchy prior for the coefficients for the 
basis functions that are included, let kn ~ Gamma(0.5, 0.5) independently for h = 1,..., H. 
This relies on the expression of the ¢ distribution as a scale mixture of normal densities, 
with an inverse-gamma mixing prior on the variance. An improper prior should not be 
chosen for the nonzero regression coefficients, as this leads to high posterior probability on 
the null model excluding all the basis functions (see Section 7.4). However, an improper 
noninformative prior can be chosen for the variance by letting a,b—0, as ø is a parameter 
common to all the possible models. 

A convenient characteristic of the above prior specification is that, assuming fixed 7 and 
Kh = K for simplicity, the full joint posterior distribution is conjugate with the posterior 
model probabilities available analytically as 


akPy (1 — T)” ply X, 7) 
Barer T? (1 — ar plz) 


where p(y|X, y) is the marginal likelihood of the data under model y, 


Pr(y|y, X) = for ally ET, (20.4) 


ply|xX, 7) = J II N(yi|wi,y By, a” N(B,|0, V,o7)Inv-gamma(o"|a, b)dB.,do°, 
i=l 


with wiy = (Wih : Yh = 1). This marginal likelihood under model y is simply the 
marginal likelihood for a normal linear regression model under a jointly conjugate mul- 
tivariate normal-gamma prior, and hence an analytic form is available. In addition, the 
posterior distribution of 64,0? given y is multivariate normal-inverse-gamma. 

Unfortunately, even though the posterior is available analytically, the posterior proba- 
bility of model y cannot be calculated unless the number of potential basis functions H is 
small since there is otherwise an enormous number (2”) of different models to sum across 
in the denominator of (20.4). For example, when H =50 there are 2°° = 1.1 x 10'° models 
under consideration. Hence, except in low-dimensional cases, approximations must be used. 
One possibility is to rely on an MCMC-based stochastic search algorithm to identify high 
posterior probability models in [, and model-average across these models. 

Another possibility is to use Gibbs sampling to update +, from its Bernoulli full condi- 
tional posterior distribution given yp) = (4, l #h), with 


x p(ylX,yn=9, Y—n)) ) z 


Priyr=l ayn) = (1+ 
( [Y ) ) 1— Zz ply X, yn =1, Y-h)) 
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which can be calculated easily. One cycle of the Gibbs sampler would update yn given y-n) 
for h = 1,...,H. This would be repeated for a large number of iterations. After warm-up 
to allow convergence, the samples represent posterior draws over the model space r. 

From these draws, one can potentially conduct model selection in order to obtain a 
simplified model that discards unnecessary basis functions. Under a 0-1 loss function, which 
assigns a loss of 1 if the incorrect model is selected and 0 otherwise, the Bayes optimal model 
corresponds to the y having the highest posterior probability. Unfortunately, unless H is 
small, it tends to be the case that there is a large number of models having similar posterior 
model probabilities to the highest posterior probability model, so that it is misleading to 
base inferences on any selected model. For this reason, model averaging across the posterior 
on y is preferred to better represent uncertainty in basis selection in estimating posterior 
summaries of the regression function u and in conducting predictions. If there is interest in 
selecting a single model, then a better alternative to the maximum posterior model may be 
the median probability model that includes all predictors (basis functions) having marginal 
inclusion probabilities Pr(y, = 1|data) > 0.5. This model provides the best single model 
approximation to Bayes model averaging for orthogonal basis functions. 


Example. Chloride concentration (continued) 

We repeated the above analysis of the chloride data using Bayesian variable selection 
to account for uncertainty in the B-spline basis functions that are needed to charac- 
terize the curve. If all 21 basis functions are included and a weakly informative N(0, I) 
or N(0,277) prior was used for the basis coefficients, we obtained an extremely poor 
fit, with the posterior mean curve dramatically shifted downwards away from the data 
and towards the horizontal line at zero where the prior is centered. We considered the 
model with all 21 basis functions as the full model, and assigned each basis function 
a prior inclusion probability of 0.5. The coefficients for the basis functions that are 
included were given independent N(0, 2?) priors, while the residual variance o? was 
independently assigned an Inv-gamma(1,1) prior. With this specification, we imple- 
mented a Gibbs sampling algorithm to sample from the full conditional distributions 
of each of the p’s and o, with a different subset of the 6;,’s automatically assigned 
to zero at each iteration. We ran the Gibbs sampler to approximate convergence; all 
of this took only a few seconds in R. The posterior mean for the number of included 
basis functions is 12.0 with a 95% posterior interval of [8.0, 16.0]. The posterior mean 
of the residual standard deviation is 6 = 0.27 with 95% interval [0.23, 0.33], suggesting 
that the measurement error variance is small. 


A potential drawback to using Bayesian variable selection to account for uncertainty 
in selection from among a prespecified collection of basis functions is that there may be 
some sensitivity to the initial choice of basis. For example, using H =21 prespecified cubic 
B-splines conveys some implicit prior information that the curve is quite smooth, and there 
are not sharp changes and spikes; in many applications, this is well justified but when spike 
functions are expected a priori one may want to use wavelets or another choice of basis. One 
can potentially include multiple types of basis functions in the initial collection of potential 
basis functions, with Bayesian variable selection used to select the subset of basis functions 
doing the best job at parsimoniously characterizing the curve. However, whenever possible 
prior information should strongly inform basis choice as well as the choice of prior on the 
coefficients. 


Shrinkage priors 


Allowing basis functions to drop out of the model adaptively by allowing their coefficients to 
be zero with positive probability is conceptually appealing, but comes with a computational 
price. When the number of models 2” in T is enormous, MCMC algorithms cannot realis- 
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tically be said to converge in that only a small percentage of the models will be visited even 
in several hundreds of thousands of iterations. In addition, there can be slow mixing due to 
the one at a time updating of the elements of y. Although block updating is possible, the 
size of the blocks is limited due to computational constraints. In addition, when nonconju- 
gate priors are used efficient computation becomes even more difficult. These problems did 
not arise in the applications to the chloride data; indeed the computation time was much 
less than a minute for enough MCMC iterations that the mixing was excellent in every case 
we considered. Nonetheless, issues may arise in considering extensions to accommodate 
multiple predictors. 

One possible solution to this problem, which also has philosophical appeal, is to avoid 
setting any of the coefficients equal to exactly zero but instead use a regularization or 
shrinkage prior such as discussed in Section 14.6. An appropriate prior would have high 
density at zero, corresponding to basis functions that can be effectively excluded as their 
coefficients are close to zero, while having heavy tails to avoid over-shrinking the signal. 
Most useful shrinkage priors can be expressed as scale mixtures of Gaussians as follows: 


Bu ~N(0,07), on~ G, 


with G corresponding to a mixture distribution for the variances. For example, one can ob- 
tain at distribution centered at 0 with v degrees of freedom by setting G = Inv-gamma(4, 5). 
In the machine learning literature, a common prior for shrinkage of basis coefficient in non- 
parametric regression corresponds to letting the degrees of freedom v in the t distribution 
approach 0. In this limiting case, one obtains a normal-Jeffreys prior. Although the pos- 
terior is improper and hence Bayesian inferences are meaningless, the resulting posterior 
mode o = (01,...,0) can contain values op = 0, and the resulting empirical Bayes poste- 
rior for Bp is concentrated at zero. This induces a type of basis function selection, though 
uncertainty in selection and estimation of the coefficients is not accommodated. 

To obtain a proper posterior and accommodate uncertainty, a common approach is to 
instead choose v equal to a small nonzero value, such as v = 1078. For v > 0, the posterior 
mode will not be exactly zero but the posterior for 6, can still be concentrated at zero for 
unnecessary basis functions as long as the number of degrees of freedom is sufficiently small. 
The commonly used default of v = 1, corresponding to a Cauchy prior, often yields good 
performance in estimating the function u and performing predictions. 

The class of scale mixture of normal distributions also includes the Laplace (double 
exponential) prior that is related to the lasso method (Section 14.6). The Laplace prior 
induces sparsity in the posterior mode, in that Bn can be exactly equal to zero, and it is 
the prior having heaviest tails which still produces a computationally convenient unimodal 
posterior density (assuming also log-concave likelihood). However, none of the draws from 
the posterior distribution will be equal to zero and in many cases the Laplace prior does 
not have enough heavy tails to not overshrink the nonzero coefficients. 

An alternative is to use a generalized double Pareto prior distribution on the regression 
coefficients, which resembles the double exponential near the origin while having arbitrarily 
heavy tails. The density has the form, 


-(a+1) 
gdP (BIE, a) = z0 $ £) 


where € > 0 is a scale parameter and a > 0 is a shape parameter. One can sample from 
the generalized double Pareto by instead drawing 6 ~ N(0,07), o? ~ Expon(A?/2), and 
à ~ Gamma(a,7) where € = 7/a. Hence the generalized double Pareto also admits an 
interpretation as scale mixture of normal representation and thus retains the computa- 
tional convenience associated with such mixtures. As a typical default specification for the 
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hyperparameters, one can let a=7=1, which leads to Cauchy-like tails. Using the gener- 
alized double Pareto density as a shrinkage prior on the basis coefficients in nonparametric 


regression, we let 
H —(a+1) 
a Bh 
(slo) = J (1+ 
ran 20n on 


which is equivalent to Ba ~ N(0,077;,), with Tn ~ Expon(A?/2) and A, ~ Gamma(a, n). 
Placing the prior p(o) x 1/o on the error variance, we then obtain a simple block Gibbs 
sampler having the following conditional posterior distributions: 


Bl- ~ N(WTW4+T771) Wy, o? (WTW 4+T771)73) 
o°|— ~ Inv-gamma((n + k)/2, (y — W8)” (y — XB)/2+ BTT 16/2) 
Anl- ~ Gamma(a + 1,|8r|/o +7) 
~ Inv-Gaussian(u = |Ana / bnl, p = A?) 


where W = (w1,..., Wn) and T = Diag(71,...,7#). 

This Gibbs sampler tends to have good convergence and mixing properties in our expe- 
rience, perhaps due largely to the block updating of 6. After convergence, one can obtain 
draws from the posterior distribution for the nonparametric regression curve p(x), which 
is expressed as a linear combination of basis functions, with the coefficients on the basis 
functions shrunk towards zero via the generalized double Pareto prior. In high-dimensional 
settings involving large numbers of potential basis functions, the tendency will be to set 
the coefficients for many of these bases close to zero while not shrinking the coefficients 
for the more important bases much at all. For nonorthogonal bases in which there is some 
redundancy, the specific bases having coefficients away from a small neighborhood of zero 
may vary across the iterations. 


20.3 Non-normal models and regression surfaces 
Other error distributions 


The above discussion has focused on continuous response variables y; with Gaussian dis- 
tributed residuals. It is straightforward to modify the methods to accommodate heavier- 
tailed residual densities that allow outliers by instead using a scale mixture of normals. In 
particular, we could let 


Yi ~ N(u(zi), hio’), Qi Sá Inv-gamma(v/2, v/2), 


which induces a t, distribution for the residual density. For low v, the t density is substan- 
tially heavier-tailed than the normal density, automatically downweighting the influence of 
outliers on the posterior distribution of u(x) without needing to discard outlying points. 
The inverse-gamma scale mixture of normals representation of the residual density is highly 
convenient in terms of posterior computation; we can simply modify the MCMC code de- 
veloped in the Gaussian residual case to include an additional step for sampling from the 
inverse-gamma conditional posterior distribution of ġ; while also modifying the other sam- 
pling steps to replace g? with ¢;07. We can additionally include a Metropolis-Hastings step 
to allow unknown degrees of freedom v, or simply fix it in advance at an elicited value. 


Example. Chloride concentration (continued) 

To assess robustness to outliers, we randomly contaminated one of the observations in 
the chloride data by adding a normal random variable having 10 times the standard 
deviation of the residual estimated in the above analysis; the contaminated obser- 
vation was y47 = 32.4. Rerunning the Gibbs sampler that accounts for uncertainty 
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in basis function selection while assuming Gaussian residuals, the posterior mean of 
o increased from ¢ = 0.27 to ¢ = 0.61, and the posterior intervals around the curve 
were substantially wider, but the estimate did not change appreciatively. However, 
repeating this exercise using 100 times the standard deviation to obtain y49 = 46.3, the 
results were poor with ¢ = 22.4 and the estimated curve pulled dramatically towards 
the horizontal line at zero and away from the data. Repeating the analysis allowing 
t residuals with degrees of freedom fixed at 4, we obtain results that were quite close 
to the results for the Gaussian model applied to uncontaminated data, with the curve 
estimate only slightly pulled up and posterior intervals only slightly wider. 


One can allow outcomes in the exponential family by relying on the same framework as 
above but with 7; = w;ĝ as the linear predictor in a generalized linear model (see Chapter 
16). In non-Gaussian cases, the marginal likelihoods needed for posterior computation 
will not in general be available analytically, and it is common in practice to rely on Laplace 
approximations. An alternative strategy that can be used for categorical responses in probit 
models is to rely on data augmentation incorporating a latent Gaussian continuous response, 
with augmented data marginal likelihoods available in closed form for the latent variables. 
Such an approach would add a step to the MCMC algorithm for sampling the underlying 
Gaussian variables from their full conditional posterior distributions. 


Multivariate regression surfaces 


Until this point, we have focused on regression models with a single predictor. In consider- 
ing generalizations to accommodate multiple predictors, one must keep in mind the curse 
of dimensionality. This curse can arise in two ways. Firstly, computational methods that 
work for a single predictor or a small number of predictors may not scale well as predictors 
are added. This is certainly the case if one attempts to prespecify sufficiently many po- 
tential basis functions to characterize an unconstrained multivariate regression surface, and 
then rely on Bayesian variable selection or shrinkage to effectively remove the unnecessary 
bases. For more than a few predictors, the number of basis functions needed may increase 
significantly and rapidly become prohibitive. A second problem for multiple predictors is 
that, even putting aside any computational issues, it may be necessary to have enormous 
amounts of data to reliably estimate a multivariate regression surface without parametric 
assumptions, substantial prior information or some restrictions. As the number of predictors 
p increases for a given sample size n, observations become much more sparsely distributed 
across the domain of the predictors ¥ C RP and hence there are typically subregions of ¥ 
having few observations. 

In such settings, the choice of prior for jz is crucial in developing Bayesian approaches for 
producing accurate interpolations across sparse data regions. One commonly used approach 
is to assume additivity so that the multivariate regression surface u mapping from the 
predictor space ¥ to the real numbers is characterized as a sum of univariate regression 
functions as follows: 


p Hj 
p(z) = po +X Bilz), Bl2) =X Ojnbjn (a4), (20.5) 
h=1 


j=1 


where 6;(-) is an unknown coefficient function for the jth predictor, which is expressed as 
a linear combination of a prespecified set of basis functions b; = { binji. For example, 
bj may correspond to B-splines as above. Focusing on the additive model case, Bayesian 
variable selection or shrinkage priors can be applied exactly as described above in the 
p = 1 case without complications. In particular, MCMC algorithms permit a divide and 
conquer approach using a modified response variable y* that subtracts the contributions of 
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the intercept and other predictors in updating the unknowns characterizing the regression 
function for the jth predictor. 

Additive models can often reduce the curse of dimensionality, and efficiency can be 
further improved by including prior information. In many applications, prior information 
takes the form of shape constraints on the regression function. For example, it may be 
reasonable to assume a priori that the mean of the response variable is nondecreasing 
in one or more of the predictors leading to a nondecreasing constraint on certain 6;(x;) 
functions in the additive expansion. Such constraints are easy to incorporate within a 
Bayesian approach by using piecewise linear or monotone splines b; and then constraining 
the regression coefficients to be nonnegative. One way to encourage sparse models is by 
using a prior distribution that is a mixture of a point mass at zero and a truncated normal 
distribution on the ;7,’s, leading to nondecreasing 8; functions that can be flat across regions 
of the predictor space. Allowing flat functions serves the dual purpose of reducing bias in 
overestimating the slope of the regression function and permitting inferences on regions 
across which the predictor has no impact. As the Bayesian approach leads to uncertainty in 
the locations of these flat regions, one can use such models to estimate posterior distributions 
of threshold predictor levels corresponding to the first value such that there is an increase 
in the response mean. More involved shape restrictions, such as unimodality and convexity, 
can also be incorporated through an appropriate prior. 


Example. A nonparametric regression function that is constrained to be 
nondecreasing 

Data from pregnant women in the U.S. Collaborative Perinatal Project were used 
to study the impact of DDE, a persistent metabolite of the pesticide DDT, on the 
risk of premature delivery. Out of 2380 pregnancies in the dataset, there were 361 
preterm births. Serum DDE concentration in mg/L was measured for each woman, 
along with potentially confounding maternal characteristics including cholesterol and 
triglyceride levels, age, BMI and smoking status (yes or no). The aim of our analysis 
was to incorporate a nondecreasing constraint on the regression function relating level 
of DDE to the probability of preterm birth in order to improve efficiency in assessing 
the dose response trend. As for other potentially adverse environmental exposures, it 
was reasonable to believe a priori that covariate-adjusted premature delivery risk is 
nondecreasing in dose of DDE. Without restricting the curve to be nondecreasing, one 
relies too much on the data and may obtain artifactual bumps that are not believable. 
However, we also do not want to impose a strictly increasing relationship a priori, as 
there may be no impact at risk at low doses or even across the whole range observed 
in the study. 

We focused on the following semiparametric probit additive model: 


5 
Pr(y;,=1]6, xj, 2) = e(a + 5 zila + sæ) = (zia + f(x:)), 
1=1 


where y; is an indicator of preterm birth, x; is DDE level, z; = (1, 2i2,..., 215) is a 
vector of the five predictor in the order listed above, and ®() is the standard nor- 
mal cumulative distribution function. The covariate adjustment is parametric, while 
f(a) is characterized nonparametrically as a nondecreasing but potentially flat curve 
using splines with a carefully structured prior on the basis coefficients. We chose dif- 
fuse N(0, 107) priors independently for the a values, though a more informative prior 
could easily be elicited in this application, as premature delivery studies are routinely 
connected, with similar covariates measured in different studies. For f(x), we simply 
used a piecewise linear function with a dense set of knots and 6; representing the slope 
within the jth interval. By choosing a prior for the 8;’s that does not allow negative 
values, we enforce the nondecreasing constraint. 
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Figure 20.4 Estimated probability of preterm birth as a function of DDE dose. The solid line is 
the posterior mean based on a Bayesian nonparametric regression constrained to be nondecreasing, 
and the dashed lines are 95% posterior intervals for the probability at each point. The dotted line 
is the maximum likelihood estimate for the unconstrained generalized additive model. 


In order to borrow information across the adjacent intervals, while enforcing the con- 
straint and placing positive probability at 6j; =0 to accommodate flat regions, we use 
a latent threshold prior. In particular, we defined a first-order normal random walk 
autoregressive prior for latent slope parameters, 8; ~ N(6%_,,03), with o3 assigned 
an inverse-gamma hyperprior to allow the data to inform about the level of smooth- 
ness. Then, to link the latent 7’s to the slopes characterizing the function, we let 
By =1 Bx>6 8; , with 6 a small positive threshold parameter, which is assigned a gamma 
hyperprior. As ô increases, it becomes more likely to sample 8; =0 and the resulting 
function has more flat regions. 

We implemented an MCMC algorithm for posterior computation. The estimated curve 
f(x) is shown in Figure 20.4. The estimated posterior probability of the global null 
hypothesis that the curve f(x) is 0 across the observed range of DDE in the sample 
was less than 0.01, in contrast to the results obtained fitting the same model using a 
simple classical approach with no constraints, which led to a p-value of 0.23. Using 
the Bayesian posterior simulations, we also estimated the first dose level at which 
there is an increasing slope. The posterior mean for this threshold dose is 7=7 with 
a 95% interval of [3,21]. Such thresholds are of broad interest in many applications, 
but we recognize that they are approximations, given that the underlying function is 
presumably continuous and always increasing to some extent. 


Although appealing in making the class of possible multivariate regression functions 
u(x) more manageable, the additivity assumption is clearly violated in many applications. 
For example, violations of additivity arise when there are interactions among the predictors, 
so that the shape and slope of the regression function in the jth predictor depends on the 
values for other predictors. One alternative approach, which also attempts to address the 
curse of dimensionality, relies on tensor product specifications, such as 


H H P 
u(x) = 5 T 5 Ohishig II bih; (xj), 
hy=1 hp=1 j=l 


where b; = {bjn} is a prespecified set of basis functions for the jth predictor, as in (20.5) 
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above but with H; =H for simplicity, and 0 = (0p,..-n,) is a p-way array (tensor) containing 
unknown coefficients. The number of coefficients in the tensor 0 can be large, particularly 
as p grows. However, we can use Bayesian variable selection or shrinkage priors to favor 
many elements of 0 close to zero. This enables effective collapsing on a lower-dimensional 
representation of the multidimensional regression surface u. For example, one can drop out 
predictors entirely or remove interactions. Conditionally on the basis functions, we have 
linearity, so that efficient computation is possible using Gibbs sampling. 


20.4 Bibliographic note 


Bishop (2006) provides a useful review of basis function models. Some key references on the 
reversible jump MCMC approach to basis function selection in nonparametric regression 
include Biller (2000) and DiMatteo, Genovese, and Kass (2001). The Bayesian variable 
selection approach to allowing uncertainty in basis selection was suggested by Smith and 
Kohn (1996). This approach applies stochastic search for posterior computation in Bayesian 
variable selection problems (George and McCulloch, 1993, 1997). Justification for median 
probability models is provided in Barbieri and Berger (2004). 

Park and Casella (2008) discuss inference using the Laplace prior distribution using 
posterior simulations, and Seeger (2008) considers expectation propagation for this model. 
References on generalized double Pareto shrinkage include Armagan, Dunson and Lee (2013) 
and Armagan et al. (2013). 

Some references on monotone Bayesian nonparametric regression include Ramsay and 
Silverman (2005), Neelon and Dunson (2004), Dunson (2005), and Hazelton and Turlach 
(2011), with Hannah and Dunson (2011) recently developing efficient methods for multivari- 
ate convex regression. Pati and Dunson (2011) use tensor product nonparametric regression 
for surface estimation. 

The chloride example comes from Bates and Watts (1988). The DDE study comes from 
Neelon and Dunson (2004). 


20.5 Exercises 


1. Basis function model: The file at naes04.csv contains age, sex, race, and attitude on 
three gay-related questions from the 2004 National Annenberg Election Survey. The three 
questions are whether the respondent favors a constitutional amendment banning same- 
sex marriage, whether the respondent supports a state law allowing same-sex marriage, 
and whether the respondent knows any gay people. Figure 20.5 shows the data for the 
latter two questions (averaged over all sex and race categories). 

For this exercise, you will only need to consider the outcome as a function of age, and 
for simplicity you should use the normal approximation to the binomial distribution for 
the proportion of Yes responses for each age. 

(a) Set up a Bayesian basis function model to estimate the percentage of people in the 
population who believe they know someone gay (in 2004), as a function of age. Write 
the model in statistical notation (all the model, including prior distribution), and 
write the (unnormalized) joint posterior density. As noted above, use a normal model 
for the data. 

(b) Program the log of the unnormalized joint posterior density as an R function. 

(c) Fit the model. You can use MCMC, variational Bayes, expectation propagation, Stan, 
or any other method. But your fit must be Bayesian. 

(d) Graph your estimate along with the data (plotting multiple graphs on a single page). 

2. Basis function model with binary data: Repeat the previous exercise but this time using 
the binomial model for the Yes/No responses. The computation will be more complicated 
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Figure 20.5 Proportion of survey respondents who reported knowing someone gay, and who sup- 
ported a law allowing same-sex marriage, as a function of age. Can you fit curves through these 
points using splines or Gaussian processes? 


but your results should be similar. Discuss any differences compared to the results from 
the previous exercise. 


3. Basis function model with multiple predictors: Repeat the previous exercise but this time 
estimating the percentage of people in the population who believe they know someone 
gay (in 2004), as a function of three predictors: age, sex, and race. 

4. Basis function model for binary data: Table 19.1 presents data on the success rate of 
putts by professional golfers. 

(a) Fit a basis function model for the probability of success (using the binomial likelihood) 
as a function of distance. Compare results to your solution of Exercise 19.2. 
(b) Use posterior predictive checks to assess the fit of the model. 

5. Hierarchical modeling and splines: The file Pollster_Data.csv gives percentage support 
for Barack Obama and Obama Romney in a series of opinion polls in the 2012 election 
campaign. Different polls are conducted by different survey organizations using different 
modes of interviewing, with different populations and different sample sizes. Estimate a 
time series of support for each candidate, adjusting for all these factors and smoothing 
the curve using a spline model for the time pattern and a hierarchical model for polling 
organation effects and for poll-to-poll variation. Compare to the smoothed average of 
the unadjusted approval numbers from this series and comment on any differences. 

6. Consider a nonparametric regression model y; = u(x) + €i, with x; € [0,1], u(x) = 
Sa Bnba(x), {bn} cubic B-spline basis functions, and the basis coefficients 3, drawn 
independently from a generalized double Pareto shrinkage prior. 

(a) For different choices of k, sample and plot realizations from the prior for p. 

(b) What is the prior expectation for u(x) and how does it depend on k and x? 

(c) What is the prior variance of u(x) and how does it depend on k and z? 

(d) Describe a modification of the generalized double Pareto prior to let E(u(a)) ~ a and 
var(u(x)) ~ 2 for all x while maintaining the prior independence assumption in the 


Bhs. 
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Chapter 21 


Gaussian process models 


In Chapter 20, we considered basis function methods such as splines and kernel regressions, 
which typically require choice of a somewhat arbitrary set of knots. One can prespecify a grid 
of many knots and then use variable selection and shrinkage to effectively discard the knots 
that are not needed, but there may nonetheless be some sensitivity to the initial grid. A 
high-dimensional grid leads to a heavy computational burden, while a low-dimensional grid 
may not be sufficiently flexible. Another possibility, which has some distinct computational 
and theoretical advantages, is to set up a prior distribution for the regression function using 
a Gaussian process, a flexible class of models for which any finite-dimensional marginal 
distribution is Gaussian, and which can be viewed as a potentially infinite-dimensional 
generalization of Gaussian distribution. 


21.1 Gaussian process regression 


Realizations from a Gaussian process correspond to random functions, and hence the Gaus- 
sian process is natural as a prior distribution for an unknown regression function u(x), with 
multivariate predictors and interactions easily accommodated without the need to explic- 
itly specify basis functions. We write a Gaussian process as y ~ GP(m,k), parametrized 
in terms of a mean function m and a covariance function k. The Gaussian process prior 
on u defines it as a random function (stochastic process) for which the values at any n 
prespecified points £1,..., £n are a draw from the n-dimensional normal distribution, 


ulzi), -.-, (an) ~N ((m(a1),...,m(an)), K(a1,---,2n)), 


with mean m and covariance K. The Gaussian process y ~ GP(m,k) is a nonparametric 
model in that there are infinitely many parameters characterizing the regression function 
u(x) evaluated at all possible predictor values x € ¥. The mean function represents an 
initial guess at the regression function, with the linear model m(a)=X corresponding to 
a convenient special case, possibly with a hyperprior chosen for the regression coefficients 
6 in this mean function. Centering the Gaussian process on a linear model, while allowing 
the process to accommodate deviations from the linear model, addresses the curse of di- 
mensionality, as the posterior can concentrate close to the linear model (or an alternative 
parametric mean function) to an extent supported by the data. The linear base model is 
also useful in interpolating across sparse data regions. 

The function k specifies the covariance between the process at any two points, with K 
an n x n covariance matrix with element (p,q) corresponding to k(x,, xq) for which we use 
shorthand notation k(x, x’). The covariance function controls the smoothness of realizations 
from the Gaussian process and the degree of shrinkage towards the mean. A common choice 
is the squared exponential (or exponentiated quadratic, or Gaussian) covariance function, 


atja 
k(x, x’) = T° exp (- ped aE | ) : 
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Figure 21.1 Random draws from the Gaussian process prior with squared exponential covariance 
function and different values of the amplitude parameter T and the length scale parameter l. 


where 7 and | are unknown parameters in the covariance! and |x — x'|? is the squared 


Euclidean distance between x and x’. Here 7 controls the magnitude and l the smoothness 
of the function. Figure 21.1 shows realizations from the Gaussian process prior assuming 
squared exponential covariance function with different values of 7 and l. 

Gaussian process priors are appealing in being able to fit a wide range of smooth surfaces 
while being computationally tractable even for moderate to large numbers of predictors. 
There is a connection to basis expansions. In particular, if one chooses Gaussian priors for 
the coefficients on the basis functions, then one actually obtains an induced GP prior for u 
with a mean and covariance function that depends on the hyperparameters in the Gaussian 
prior as well as the choice of basis. To demonstrate this, let 


A 
y(t) = X Brba(z), B= (B1, ---, BH) ~ N(6o, £8), 


h=1 


which is a basis function model with a multivariate normal prior on the coefficients. Then, 


((21),---,M(2n)) ~ Nn ((m(a1),---,m(2n)), kE tTn) 


with mean and covariance function 
m(x) = b(£)bo, k(x, x) = b(x)* Dgd(2’), 


and b(x) = (b1(x),...,ba#(a)). The relationship works the other way as well. Gaussian 
processes with typical covariance functions, such as squared exponential, have equivalent 
representations in terms of infinite basis expansions, and truncations of such expansions can 
be useful in speeding up computation. 


Covariance functions 


Different covariance functions can be used to add structural prior assumptions like smooth- 
ness, nonstationarity, periodicity, and multiscale or hierarchical structures. Sums and prod- 
ucts of Gaussian processes are also Gaussian processes which allows easy combination of 
different covariance functions. Linear models can also be presented as Gaussian processes 
with a dot product covariance function. Although it is not typically computationally most 
efficient to present a linear model as a Gaussian process, it makes it easy to extend hierar- 
chical generalized linear models to include nonlinear effects and implicit interactions. 
When there is more than one predictor and the focus is on a multivariate regression 


1In some of the literature, the parameterization is in terms of a = z 
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surface, it is typically not ideal to use a single parameter l to control the smoothness of 
u in all directions. Such isotropic Gaussian processes do not do a good job at efficiently 
characterizing regression surfaces in which the mean of the response changes more rapidly in 
certain predictors than others. In addition, it may be possible to get just as good predictive 
performance using only a subset of predictors in the regression function. In such settings, 
anisotropic Gaussian processes may be preferred and one can use, for example, a modified 
squared exponential covariance function, 


2 
Tj- T 
k(x, 2’) = cov(u(z), u(2')) = 7? exp | -X Cin s , 
r 2l 
j=1 J 
where lj is a length scale parameter controlling smoothness in the direction of the jth 
predictor. One can do nonparametric variable selection by choosing hyperpriors for these 
lj’s to allow data adaptivity to the true anisotropic smoothness levels, so that predictors 
that are not needed drop out with large values for the corresponding 1;’s. 


Inference 


Given a Gaussian observation model, y; ~ N (ui, o°), i =1,...,n, Gaussian process priors 
are appealing in being conditionally conjugate given 7, 1,0, so that the conditional posterior 
for u given (zi, yi);—ı is again a Gaussian process but with updated mean and covariance. 
In practice, one cannot estimate u at infinitely many locations and hence the focus is on 
the realizations at the data points x = (£1,..., £n) and any additional locations % at which 
predictions are of interest. Given Gaussian process prior GP(0, k), the joint density for 
observations y and p at additional locations & is simply a multivariate Gaussian 


Ono aes ))> 


where noise variance a? has been added to the diagonal of covariance of u to get the 
covariance for y. The conditional posterior of ù conditionally on 7,/,0 and data is obtained 
from the properties of multivariate Gaussian (recall the presentation of the multivariate 
normal model with known variance in Section 3.5). For zero prior mean the posterior for ji 
at a new value ț not in the original dataset x is 


Alz, y, T, ho ~ NŒ), cov(i)) 
E(@) = K(ž,£)(K(z,x)+0°I) ty 
cov(fi) = K(#,£) — K(ž,x)(K(z,x)+0°I) tK (x, 2). 


Figure 21.2 shows draws j*° from the posterior distribution of a Gaussian process, assuming 
same Gaussian process priors as in Figure 21.1 and ø = 0.1. 

It may seem that posterior computation in Gaussian process regression is a trivial matter, 
but there are two main hurdles involved. The first is that computation of the mean and 
covariance in the n-variate normal conditional posterior distribution for ñ involves matrix 
inversion that requires O(n?) computation. This computation needs to be repeated, for 
example, at each MCMC step with changing hyperparameters, and hence the computation 
expense increases so rapidly with n that it becomes challenging to fit Gaussian process 
regression models when n is greater than a few thousand and the number of predictors is 
greater than 3 or the number of hyperparameters is greater than 15 or so. 


Covariance function approximations 


There is a substantial literature on approximations to the Gaussian process that speed 
computation by reducing the matrix inversion burden. Some Gaussian processes can be 
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Figure 21.2 Posterior draws of a Gaussian process u(x) fit to ten data points, conditional on three 
different choices of the parameters T,l that characterize the process. Compare to Figure 21.1, which 
shows draws of the curve from the prior distribution of each model. In our usual analysis, we would 
assign a prior distribution to T,l and then perform joint posterior inference for these parameters 
along with the curve u(x); see Figure 21.3. We show these three choices of conditional posterior 
distribution here to give a sense of the role of T,l in posterior inference. 


represented as Markov random fields. When there are three or fewer predictors, Markov 
random fields can be computed efficiently by exploiting conditional independence to produce 
a sparse precision matrix. In the univariate case the computation can be made in time O(n) 
using sequential inference, and computation is no problem even for n greater than million. 
For spatial problems the computation can be made in time O(n3/ 2) with specific algorithms. 
In low-dimensional cases it is also possible to approximate Gaussian processes with basis 
function approximations where the number of basis functions m is much smaller than n. As 
the number of dimensions increases, the choice of basis functions in the data space becomes 
more difficult (see Section 20.3). 

The above-mentioned approximations can be used when the number of predictors is 
large, if the latent function is modeled as additive, that is, as a sum of low-dimensional 
approximations (see Section 20.3). Interactions of 2 or 3 predictors can be included in each 
additive component, but models with implicit interactions between arbitrary predictors 
cannot be easily computed. 

If there are rapid changes in the function, the length scale of the dependency is rela- 
tively short. Then sparse covariance matrices can be obtained by using compact support 
covariance functions, potentially reducing greatly the time needed for the inversion. The 
covariance matrix can be sparse even if there are tens of predictors. 

If the function is smooth, the length scale of the dependency is relatively long. Then 
reduced rank approximations of the covariance matrices can be obtained in many different 
ways, reducing the time needed for the inversion to O(mn?), where m <n. Reduced rank 
approximations can be used for high-dimensional data, although it may be more difficult 
to set them up so that approximation error is small everywhere. Different covariance func- 
tion approximations can also be combined in additive models, for example, by combining 
different approximations for short and long length scale dependencies. 


Marginal likelihood and posterior 


If the data model is Gaussian, we can integrate over u analytically to get the log marginal 
likelihood for covariance function parameters T and l and residual variance o?: 


1 1 
log p(y|r, l, a°) = -5 log(27) — z log| K(x, £) + a°I| = ay (Ke, x) + o’ Iy y. (21.1) 


The marginal likelihood is combined with the prior to get the unnormalized marginal pos- 
terior, and inference can proceed with methods described in Chapters 10-13. Figure 21.3 
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Figure 21.3 Marginal posterior distributions for Gaussian process parameters T, | and error scale 
a, and posterior mean and pointwise 90% bands for u(x), given the same ten data points from 
Figure 21.2. 


shows an example of estimated marginal posterior distributions for 7, l, and o, and the 
posterior mean and pointwise 90% bands for u(x) using the same data as in Figure 21.2. 
We obtained the posterior simulations using slice sampling. The priors were tf (0, 1) for T 
and l, and log-uniform for ø. 


21.2 Example: birthdays and birthdates 


Gaussian processes can be directly fit to data, but more generally they can be used as com- 
ponents in a larger model. We illustrate with an analysis of patterns in birthday frequencies 
in a dataset containing records of all births in the United States on each day during the 
years 1969-1988. We originally read about these data being used to uncover a pattern of 
fewer births on Halloween and excess births on Valentine’s Day (due, presumably, to choices 
involved in scheduled deliveries, along with decisions of whether to induce a birth for health 
reasons). We thought it would be instructive to fit a model to look not just at special days 
but also at day-of-week effects, patterns during the year, and longer-term trends. 


Decomposing the time series as a sum of Gaussian processes 


Based on the structural knowledge of the calendar and, we started with an additive model, 


lt) = filt) + folt) + fa(t) + fat) + f(t) + ee, 


where t is time in days (starting with t = 1 on 1 January 1969), and the different terms 
represent variation with different scales and periodicity: 


1. Long-term trends modeled by a Gaussian process with squared exponential covariance 
function: 


2 
filt) as GP(0, kı), k(t, t’) aa o? exp (- |t zn | ) ; 
1 


2. Shorter term variation using a GP with squared exponential covariance function with 
different amplitude and scale: 


— #12 
folt) ~~ GP(0, ko), ka(t,t") = o2 exp (- k m | ) : 
2 


3. Weekly quasi-periodic pattern (that is allowed to change over time) modeled as a product 
of periodic and squared exponential covariance function: 


2 sin? (x(t — t')/7 t-v/? 
fat) ~ GP(O, ks), kalt, t’) = o3 exp (- | = E ET ) 
3,1 3,2 
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4. Yearly smooth seasonal pattern using product of periodic and squared exponential co- 
variance function (with period 365.25 to match the average length of the year): 


2 sin? (1(s — aane) m (- |s— £) 


fa(t) = GP(0, ka), ka(s, s') a o4 exp | — 2 
214 9 


2 
qa 


where s = s(t) = tmod 365.25, thus aligning itself with the calendar every four years. 


5. Special days including an interaction term with weekend. Based on a combination of 
initial visual inspection and prior knowledge we chose the following special days: New 
Year’s Day, Valentine’s Day, Leap Day, April Fool’s Day, Independence Day, Halloween, 
Christmas, and the days between Christmas and New Year’s. 


fs (t) = special day (t)Ba T Iweckend (t) Ispecial day (t) Bo, 


where Ispecial day(t) is a row vector of 13 indicator variables corresponding to each of 
the special days (we can think of this vector of one row of an n x 13 indicator matrix 
Ispecial day); Iweekend(t) is an indicator variable that equals 1 if t is a Saturday or Sunday, 
and 0 otherwise; and a and ( are vectors, each of length 13, corresponding to the effects 
of special days when they fall on weekdays or weekends. 


6. Finally, e ~ N(0, o?) represents the unstructured residuals. 


We set weakly informative log-t priors for the time-scale parameters l (to improve iden- 
tifiability of the model) and log-uniform priors for all the other hyperparameters. We 
normalized the number of daily births y to have mean 0 and standard deviation 1. 

The sum of Gaussian processes is also a Gaussian process, and the covariance function 
for the sum is 


k(t, t) = ky (t,t!) + kolt, t) + kg (t,t!) + kalt, t) + ks (t, t). 


The inference for the model is then straightforward with basic Gaussian process equations. 

We analytically determined the marginal likelihood and its gradients for hyperparam- 
eters as in (21.1), and we used the marginal posterior mode for the hyperparameters. As 
n was relatively high (corresponding to all the days during a twenty-year period, that is 
n ~ 20- 365.25), this posterior mode was fine in practice. Central composite design (CCD) 
integration gave visually indistinguishable plots, and MCMC would have been too slow. 
The Gaussian process formulation with O(n?) computation time is not optimal for this 
kind of one-dimensional data, but computation time was still reasonable. 

Figure 21.4 shows the slow trend, faster non-periodic correlated variation, weekly trend 
and its change through years, seasonal effect and its change through years, and day of year 
effects. All plots are on the same scale showing differences relative to a baseline of 100. 
Predictions for different additive components can be computed with the usual posterior 
equation (21.1) but using only one of the covariance functions to compute the covariance 
between training and the test data. For example, the mean of the slow trend is computed 
as 


Elfi) = Ki(@,2)(K (a, x) + 0°I) ty. (21.2) 


The smooth seasonal effect has an inverse relation to the amount of daylight or the average 
temperature nine months before. The smaller number of births in weekends and smaller or 
larger number of births on special days can be explained by selective c-sections and induced 
births. The day-of-week patterns become more pronounced over time, which makes sense 
given the general increasing rate of these sorts of births. 
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Figure 21.4 Relative number of births in the United States based on exact data from each day 
from 1969 through 1988, divided into different components, each with an additive Gaussian process 
model. The estimates from an improved model are shown in Figure 21.5. 


An improved model 


Statistical models are not built at once. Rather, we fit a model, notice problems, and 
improve it. In this case, selecting just some special days makes it impossible to discover 
other days having a considerable effect. Also we might expect to see a ‘ringing’ pattern 
with a distortion of births just before and after the special days (as the babies have to be 
born sometime). 


To allow for these sorts of structures, we constructed a new model that allowed special 
effects for each day of the year. While analyzing the first model we also noticed that the 
residuals were slightly autocorrelated, so we added a very short time-scale non-periodic 
component to explain that. To improve yearly periodic components we also refined the 
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handling of the leap day. Our improved model has the form 
ye(t) = Filt) + folt) + falt) + fat) + Fst) + felt) + frt) + falt) +e : 


1. Long-term trends modeled by a Gaussian process with squared exponential covariance 
function: 


_ 42 
filt) ~ GP(0,k1), ki(t,t') = of exp (- i 7 | | 
1 


2. Shorter term variation using a GP with squared exponential covariance function with 
different amplitude and scale: 


— ft! 2 
f(t) ~ GP(0,k2), ko(t, t’) = 03 exp (- i 212 ) 
2 


3. Weekly quasi-periodic pattern (that is allowed to change over time) modeled as a product 
of periodic and squared exponential covariance function: 


f(t) ~ GP(0,k3), k3(t,t') = o3 exp (- satin) exp (- E- £) ; 


2 
I3 4 


4. Yearly smooth seasonal pattern using product of periodic and squared exponential co- 
variance function: 


falt) ~ GP(0,ka), ka(t,t’) = o2 exp (- ti exp (- —— , 


2 
lia 


s = s(t) is now a modified time with time before and after leap day incremented by 0.5 
day so that in s the length of year is 365 also for leap years (making easier implementation 
of yearly periodicity). 

5. Yearly fast changing pattern for weekdays (day-of-year effect) using a periodic covariance 
function: 


2sin?°(z(s — s')/365 
Jolt) ~ GP(O,ks), ells) = Iveskany(tt)o8 exp ( — EEE A, 
5 


where Iweekday (t, t’) is an indicator variable that equals to 1 if both t and t’ are weekdays, 
and 0 otherwise. 
6. A similar pattern for weekends: 
2 sin? (a(s — s')/365 
felt) a GP (0, ke), ke (t, t’) = eken t ee exp (- ee ) 
6 


where Iweekena(t, t’) is an indicator variable that equals to 1 if both t and t’ are Saturday 
or Sunday, and 0 otherwise. 


7. Effects of special days whose dates are not constant from year to year (Leap Day, Memo- 
rial Day, Labor Day, Thanksgiving): 


fr(t) = Ispecial day (t)B, 
where [special day(t) is now a row vector of 4 indicator variables corresponding to these 
floating holidays. 


8. Short-term variation using a Gaussian process with squared exponential covariance func- 
tion: 


t— t! 2 
fs(t) ~ GP(0,kg), k(t, t’) = of exp (- | JE | ) 
8 
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Figure 21.5 Relative number of births in the United States based on exact data from each day 
from 1969 through 1988, divided into different components, each with an additive Gaussian process 
model. Compared to Figure 21.4, this improved model allows individual effects for every day of the 
year, not merely for a few selected dates. 


9. Finally, e ~ N(0, ø?) models the unstructured residual. 


We set weakly informative log-t priors for time-scale parameters l (to improve identifiability 
of the model) and log-uniform prior for all the other hyperparameters. The number of births 
y was normalized to have mean 0 and standard deviation 1. 

Exploiting properties of multivariate Gaussian, leave-one-out cross-validation pointwise 
predictions can be computed in similar time as posterior predictions. The cross-validated 
pointwise predictive accuracy is lppd,,,_., = 2074 for the first model and 2477 for the 
improved model, showing clear improvement. 

Figure 21.5 shows the results for the second model. The trends and day of week effect 
are indistinguishable from the first model, but the seasonal component is smoother as it 
does not need to model the increased number of births before or after special days and 
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before the end of the year, which are now modeled in the day of year component. This new 
model is not perfect either (for one thing, it would make sense to constrain local positive 
and negative effects to average approximately to zero so that extra babies are explicitly 
‘borrowed’ from neighboring days), but we believe that the decomposition shown in Figure 
21.5 does a good job of identifying the major patterns at different time scales. The trick 
was to use Gaussian processes to allow different scales of variation for different components 
of the model. This example also illustrates how we are able to keep adding terms to the 
additive model without losing control of the estimation. 


21.3 Latent Gaussian process models 


In case of non-Gaussian likelihoods, the Gaussian process prior is set to a latent function 
f which through a link function determines the likelihood p(y|f,¢) as in generalized linear 
models (see Chapter 16). Typically the shape parameter ¢ is assumed to be a scalar, but 
it is also possible to use separate latent Gaussian processes to model location f and shape 
parameter ¢ of the likelihood to allow, for example, the scale to depend on the predictors. 

The conditional posterior density of the latent f is p(f|z,y,@,¢) x plylf, d)p(f|z, 0). 
For efficient MCMC inference, GP-specific samplers can be used. These samplers exploit the 
multivariate Gaussian form of the prior for the latent values in the proposal distribution or 
in the scaling of the latent variables. The most commonly used samplers are the elliptic slice 
sampler, scaled Metropolis-Hastings, and scaled HMC/NUTS (these are Gaussian-process- 
specific variations of the samplers discussed in Chapters 11 and 12). Typically the sampling 
is done alternating the sampling of latent values f and covariance and likelihood parameters 
6 and ¢. Due to dependency between the latent values and the (hyper)parameters, mixing 
of the MCMC can be slow, creating difficulties when fitting to larger datasets. 

As the prior distribution for latent values is multivariate Gaussian, the posterior distri- 
bution of the latent values is also often close to Gaussian; this motivates Gaussian posterior 
approximations. The simplest approach is to use the normal approximation (Chapter 4) 


P( fla, y, 8,0) ~ N(FIF, £), 
where f is the posterior mode and 
E`! = K(x,x) +W, 


where K(x,x) is the prior covariance matrix and W is a diagonal matrix with W;; = 
F log plul fi. 9) 


fi=fe The approximate predictive density 


can be evaluated, for example, with quadrature integration. Log marginal likelihood can 
be approximated by integrating over f using Laplace’s method (Section 13.3) 


: 1. x ï 
log p(y|x, 0, p) © log g(y|x, 0, 6) x log p(y| f, 6) — zÍ K(2,0)"f —zlog|B], (21.3) 


where |B| = |I + W/?K (a, 2)W'/?|. 

If the likelihood contribution is heavily skewed, as can be the case with the logistic 
model, expectation propagation (Section 13.8) can be used instead. Variational approxi- 
mation (Section 13.7) has also been used, but in many cases it is slower than the normal 
approximation and not as accurate as EP. Using one of these analytic approximations for 
the latent posterior, the approximate (unnormalized) marginal posterior of the hyperparam- 
eters q(0, d|x, y) and its gradients can be computed analytically, allowing one to efficiently 
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find the posterior mode of the hyperparameters (Section 13.1) or to integrate over the hy- 
perparameters using MCMC (Section 12), integration over a grid (Section 10.3), or CCD 
integration (Section 13.9). 


Example. Leukemia survival times 

To illustrate a Gaussian process model with non-Gaussian data, we analyze survival 
in acute myeloid leukemia (AML) in adults. The example illustrates the flexibility of 
Gaussian process for modeling nonlinear effects and implicit interactions. 

As data we have survival times ¢ and censoring indicator z (0 for observed and 1 for 
censored) for 1043 cases recorded between 1982 and 1998 in the North West Leukemia 
Register in the United Kingdom. Some 16% of cases were censored. Predictors are 
age, sex, white blood cell count (WBC) at diagnosis with 1 unit = 50 x 10°/L, and 
the Townsend score which is a measure of deprivation for district of residence. 

As the WBC measurements were strictly positive and highly skewed, we fit the model 
to its logarithm. In theory as n + oo a Gaussian process model could be fit with an 
estimated link function so as to learn this sort of mapping directly from the data, but 
with finite n it can be helpful to preprocess the data using sensible transformations. 
Continuous predictors were normalized to have zero mean and unit standard deviation. 
Survival time was normalized to have zero mean for the logarithm of time (this way 
the constant term will be smaller and computation is more stable). 

We tried several models including proportional hazard, Weibull, and log-Gaussian, but 
the log-logistic gave the best results. The log-logistic model can be considered as a 
more robust choice compared to log-Gaussian which assumes a Gaussian observation 
model for the logarithm of the survival times. As we do not have a model for the 
censoring process, we do not have a full observation model. The likelihood for the 
log-logistic model is, 


no-spam) (+ (act) ) 


where r is the shape parameter and z; the censoring indicators (see Section 8.7 for 
more about censored observations). We center the Gaussian process on a linear model 
to get a latent model, 

fi( Xi) =a t+ Xıl + p(X), 


where u ~ GP(0, k) with squared exponential covariance function, 


1 1 2 |; = Al 
k(x, 2") = cov(u(2),u(2!)) = 02 exp | -5 
j=l j 


We set weakly informative priors: uniform on logr, normal priors a ~ N(0,02), 
B ~ N(0, 03), independent Inv-x?°(1, 1) priors on o2, 0o}, 02, and lj ~ Cauchy (0, 1). 
We used Laplace’s method for the latent values and CCD integration for the hyperpa- 
rameters. Expectation propagation method produced similar results but was an order 
of magnitude slower. 

Using linear response and importance weighting approximation, computation time 
for leave-one-out cross-validation predictions was three times the posterior inference. 
For a model with only constant and linear latent components, the cross-validated 
pointwise predictive accuracy is lPPdjoo-cey = —1662, or —1629 for the model with 
additional nonlinear components, which is a clear improvement. Figure 21.6 shows 
estimated conditional comparison of each predictor with all others fixed to their mean 
values or defined values. The model has found clear nonlinear patterns, and the right 
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Figure 21.6 For the leukemia example, estimated conditional comparison for each predictor with 
other predictors fixed to their mean values or defined values. The thick line in each graph is the 
posterior median estimated using a Gaussian process model, and the thin lines represent pointwise 
90% intervals. 


bottom subplot also shows that the conditional comparison associated with WBC has 
an interaction with the Townsend deprivation index. The benefit of the Gaussian 
process model was that we did not need to explicitly define any parametric form for 
the functions or define any interaction terms. 

In the previous studies, WBC was not log-transformed, and a decrease in the expected 
lifetime when WBC was small was not found, the interaction between WBC and TDI 
was not found as only additive models were considered, and an additional spatial 
component explained some of the variation, but it did not have significant effect when 
added to the model described above. 


21.4 Functional data analysis 


Functional data analysis considers responses and predictors for a subject not as scalar or 
vector-valued random variables but instead as random functions defined at infinitely-many 
points. In practice one can collect observations on these functions only at finitely many 
points. Let yi = (yi1,---;Yin;) denote the observations on function f; for subject i, where 
yij is an observation at point t;;, with ti; € T. For example, fi : T — R may correspond to a 
trajectory of blood pressure or body weight as a function of age. Allowing for measurement 
errors in observations of a smooth trajectory, we let 


vig ~ N(fa(tig), 07). 


The question is then how to model the collection of functions { f;}7_, for the different sub- 
jects, accommodating for flexibility in modeling the individual functions while also allowing 
borrowing of information. 
Gaussian processes can be easily used for functional data analysis. For example in 
normal regression we have 
yig ~ N(f (ai, tij), 07), 


where x; are subject specific predictors. We set a Gaussian process prior f ~ GP(m,k), 
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with, for example, squared exponential covariance, by adding time as an additional dimen- 
sion: 


(=a. (t-t)? 
2 J J 
t exp | — 5 JE + JE ; 
j=1 j p+1 
where 7 controls the magnitude and l4, ... ,lp+1 control the smoothness in different predictor 


directions. More similar x and x’ imply more similar f and f’. This kind of functional data 
analysis is naturally modeled by Gaussian process and requires no additional computational 
methods. 


21.5 Density estimation and regression 


So far we have discussed Gaussian processes as prior distributions for a function controlling 
the location and potentially the shape parameter of a parametric observation model. To get 
more flexibility we would like to model also the conditional observation model as nonpara- 
metric. One way to do this is the logistic Gaussian process (LGP), and in later chapters we 
consider an alternative approach based on Dirichlet processes. 


Density estimation 


In introducing the LGP, we start with the independent and identically distributed case, 


Yi = p, for simplicity. The LGP works by generating a random surface (curve in the one- 
dimensional density case) from a Gaussian process and then transforming the surface to the 
space of probability densities. To constrain the function to be non-negative integrate to 1, 
we use the continuous logistic transformation, 


ef (y) 
plylf) = Tedy 


where f ~ GP(m, k) is a realization from a continuous Gaussian process. It is appealing 
to choose m to be a log density of elicited parametric distribution such as Student’s t- 
distribution with parameters chosen empirically or integrated out as part of the inference. 
The covariance function k defining the smoothness properties of the process can be chosen 
to be, for example, squared exponential: 


ly — y'l? 
k(y, y’) = 77 exp os 


where 7 controls the magnitude while / controls the smoothness of the realizations. 
An alternative specification uses a zero-mean Gaussian process W(t) on [0, 1] and defines 


eW (Goly)) 
ply) = goly Pew Ode’ 


where go is some elicited parametric distribution with cumulative distribution function Go. 
The compactification by Go allows modeling of a GP from [0,1] > R and makes the prior 
smoother on tails of gg, which may sometimes be a desirable property. 

The challenge for the inference is the integral over continuous f in the denominator of the 
likelihood. In practice this integral is computed using a finite basis function representation 
or a discretization of a chosen finite region. Inference for f and parameters of m and k can be 
made using various Markov chain simulation methods, or using a combination of Laplace’s 
method for the latent values f and quadrature integration for parameters. Compared to 
mixture model density estimation (see Chapters 22 and 23), the logistic Gaussian process 
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Figure 21.7 Two simple examples of density estimation using Gaussian processes. Left column 
shows acidity data and right column shows galary data. Top row shows histograms and bottom row 
shows logistic Gaussian process density estimate means and 90% pointwise posterior intervals. 


has a computational advantage as the posterior of f is unimodal given hyperparameters 
(and fixed finite representation). 


Example. One-dimensional densities: galaxies and lakes 

We illustrate the logistic Gaussian process density estimation with two small univariate 
datasets from the statistics literature. The first example is a set of 82 measurements 
of speeds of galaxies diverging from our own. The second example is a set of measure- 
ments of acidity from 155 lakes in north-central Wisconsin. We fit a Gaussian process 
with Matern (v = 3) covariance function centered on Gaussian. We use the Laplace 
method (that is, mode-centered normal approximation) to integrate over f and the 
marginal posterior mode for the hyperparameters. An additional prior constraint forc- 
ing tails to be decreasing was implemented with rejection sampling. Figure 21.7 shows 
usual histograms and LGP density estimates. Compared to the density estimate using 
a mixture of Gaussians in Figure 22.4, the LGP produces more a flexible form for the 


density estimates. 


Density regression 


The above LGP prior can be easily generalized to density regression by placing a prior on 
the collection of conditional densities, py = {p(y|x),x € RP, y E€ R} as 


ef (zy) 
p(y|z) = Teren ef (eu dy!’ 


where f is drawn from a Gaussian process with, for example, squared exponential covariance 

kernel 

(vj — 25)? | y-y'?? 

k yN) =r e — I p i 21.4 

C a e + (21.4) 
j=l 

One can potentially choose hyperpriors for the ljs that allow certain predictors to effectively 

drop out of the model while allowing considerable changes for other predictors. 
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Letting s = (s1,...,5p») € [-1, 1]? and t € [0, 1] and prespecifying monotone continuous 
functions F} : R — [—1,1], for j = 1,...,p an alternative compactifying representation is 


obtained as 
eW (F(x),Go(y)) 


p(ylz) = DOTTI 


where F(x) = (Fi (z1), ..., Fp(£p)) and W is drawn from a Gaussian process. Here the 
compactification brings the same advantages as in the unconditional density estimation. 

Inference in density regression is similar to unconditional density estimation, although 
additional challenges arise if there are multiple predictors requiring approximations to the 
GP or computation to keep the finite representation of the multidimensional function com- 
putationally feasible. 


Latent-variable regression 


Recently, a simple alternative to the LGP was proposed that also relies on a Gaussian 
process prior in inducing priors for densities or conditional densities, but in a fundamentally 
different manner. Focusing again on the univariate iid case to start, consider the latent- 
variable regression model (without predictors), 


yi ~ N(ului), o°), ui ~ U(0,1), (21.5) 


where u; is a uniform latent variable and p : [0,1] > R is an unknown regression function. 
The prior distribution for 4,0, induces a corresponding prior distribution f ~ II on the 
density of y;. There is a literature on similar nonparametric latent variable models to 
(21.5) motivated by dimensionality reduction in multivariate data analysis. One can sample 
yi ~ fo for any strictly positive density fọ by drawing a uniform random variable u; and 
then letting y; = Fy '(ui), with Fy‘ the inverse cumulative distribution corresponding to 
density fo. This provides an intuition for why (21.5) is flexible given a flexible prior on p 
and a prior on g? that assigns positive probability to neighborhoods of zero. 

From a practical perspective, a convenient and flexible choice corresponds to drawing 
u from a Gaussian process centered on fig with a squared exponential covariance kernel. 
To center the prior on a guess go for the unknown density, one can let uo = Gg 1. Such 
centering will aid practical performance when the guess go provides an adequate approxi- 
mation. Computation can be implemented through a straightforward data augmentation 
algorithm. Conditionally on the latent variables u;, expression (21.5) is a standard Gaus- 
sian process regression model and computation can proceed via standard algorithms, which 
for example update u at the unique u; values from a multivariate Gaussian conditional. 
By approximating the continuous U(0,1) with a discrete uniform over a fine grid, one can 
(a) enable conjugate updating of the u;’s over the grid; and (b) reduce the computational 
burden associated with updating u at the realized u; values. 

The latent variable regression model (21.5) can be easily generalized for the density 
regression problem by simply letting 


gi N(u(ui, £i), a”), Ui ~ U (0, Ly, 


where x; = (j1,... , Lip) is the vector of observed predictors. This model is essentially iden- 
tical to (21.5) and computation can proceed in the same manner, but now p : RPI > R is 
a (p + 1)-dimensional surface drawn from a Gaussian process instead of a one-dimensional 
curve. The covariance function can be chosen to be squared exponential with a differ- 
ent spatial-range (smoothness) parameter for each dimension, similarly to (21.4). This is 
important in allowing sufficient flexibility to adaptively drop out predictors that are not 
needed and allow certain predictors to have a substantially larger impact on the conditional 
response density. 


This electronic edition is for non-commercial purposes only. 


516 21. GAUSSIAN PROCESS MODELS 
21.6 Bibliographic note 


Some key references on Bayesian inference for Gaussian processes are O’Hagan (1978), Neal 
(1998), and Rasmussen and Williams (2006). Reviews of many recent advances in computa- 
tion for Gaussian process are included in Vanhatalo et al. (2013a). Rasmussen and Nickish 
(2010) provide software for the book by Rasmussen and Williams (2006), and Vanhatalo et 
al. (2013b) provide software implementing a variety of different models, covariance approx- 
imations, inference methods (such as Laplace, expectation propagation, and MCMC) and 
model assessment tools for Gaussian processes. 

Some recent references on efficient Bayesian computation in Gaussian processes are 
Tokdar (2007), Banerjee, Dunson, and Tokdar (2011) for regression, Hensman, Fusi and 
Lawrence (2013) for big data, Vanhatalo and Vehtari (2010) for binary classification, Riihi- 
maki, Jylanki, and Vehtari (2013) for multiclass classification, Vanhatalo, Pietilainen, and 
Vehtari (2010) for disease mapping by combining long and short scale approximations, Jy- 
lanki, Vanhatalo, and Vehtari (2011) for robust regression, Riihimaki and Vehtari (2010) for 
monotonic regression, Tolvanen, Jylanki and Vehtari (2014) for nonstationary heteroscedas- 
tic regression, and Vehtari et al. (2016) for leave-one-out cross-validation. Savitsky, Van- 
nucci, and Sha (2011) propose an approach for variable selection in high-dimensional Gaus- 
sian process regression. Lindgren, Rue, and Lindstrom (2013) approximate some Gaussian 
processes with Gaussian Markov random fields by using an approximate weak solution of 
the corresponding stochastic partial differential equation. Sarkka, Solin, and Hartikainen 
(2013) show how certain types of space-time Gaussian process regression can be converted 
into finite or infinite-dimensional state space models that allow for efficient (linear time) 
inference via the methods from Bayesian filtering and smoothing reviewed by Sarkka (2013). 

The birthday data come from the National Vital Statistics System natality data and 
are at http: //www.mechanicalkern.com/static/birthdates-1968-1988.csv, provided 
by Robert Kern using Google BigQuery. 

The leukemia example comes from Henderson, Shimakura, and Gorst (2002). Joensuu 
et al. (2012) apply a Gaussian process Cox proportional hazards model and Joensuu et 
al. (2014) apply a Gaussian process non-proportional hazard model with time dependent 
covariates and interval censored data in GIST cancer recurrence prediction. 

Myllymaki, Sarkka, and Vehtari (2013) discuss Gaussian processes in functional data 
analysis for point patterns. 

The logistic Gaussian process (LGP) was introduced by Leonard (1978). Lenk (1991, 
2003) develops inferences based on posterior moments. Tokdar (2007) presents MCMC im- 
plementation of unconditional LGP, and Tokdar, Zhu, and Ghosh (2010) present MCMC for 
LGP density regression. Riihimaki and Vehtari (2014) derive a fast Laplace approximation 
for LGP density estimation and regression. Tokdar and Ghosh (2007) and Tokdar, Zhu, 
and Ghosh (2010) prove consistency of LGP for density estimation and density regression, 
respectively. Adams et al. (2009) propose an alternative GP approach in which the numer- 
ical approximation of the normalizing term in the likelihood is avoided by a conditioning 
set and an elaborate rejection sampling method. The latent-variable regression model in 
the last part of Section 21.5 comes from Kundu and Dunson (2014). 


21.7 Exercises 


1. Replicate the sampling from the Gaussian process prior from Figure 21.1. Use univariate 
x in a grid. Generation of random samples from the multivariate normal is described in 
Appendix A. Invent some data, compute the posterior mean and covariance as in (21.1), 
and sample functions from the posterior distribution. 


2. Gaussian processes: The file at naes04. csv contains age, sex, race, and attitude on three 
gay-related questions from the 2004 National Annenberg Election Survey. The three 
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questions are whether the respondent favors a constitutional amendment banning same- 
sex marriage, whether the respondent supports a state law allowing same-sex marriage, 
and whether the respondent knows any gay people. Figure 20.5 on page 499 shows the 
data for the latter two questions (averaged over all sex and race categories). 

For this exercise, you will only need to consider the outcome as a function of age, and 
for simplicity you should use the normal approximation to the binomial distribution for 
the proportion of Yes responses for each age. 

(a) Set up a Gaussian process model to estimate the percentage of people in the population 
who believe they know someone gay (in 2004), as a function of age. Write the model 
in statistical notation (all the model, including prior distribution), and write the 
(unnormalized) joint posterior density. As noted above, use a normal model for the 
data. 

(b) Program the log of the unnormalized marginal posterior density of hyperparameters 
Eq. (21.1) as an R function. 

(c) Fit the model. You can use MCMC, normal approximation, variational Bayes, expec- 
tation propagation, Stan, or any other method. But your fit must be Bayesian. 

(d) Graph your estimate along with the data (plotting multiple graphs on a single page). 

3. Gaussian processes with binary data: Repeat the previous exercise but this time using 
the binomial model for the Yes/No responses. The computation will be more complicated 
but your results should be similar. Discuss any differences compared to the results from 
the previous exercise. 

4. Gaussian processes with multiple predictors: Repeat the previous exercise but this time 
estimating the percentage of people in the population who believe they know someone 
gay (in 2004), as a function of three predictors: age, sex, and race. 

5. Gaussian processes with binary data: Table 19.1 on page 486 presents data on the success 
rate of putts by professional golfers. 

(a) Fit a Gaussian process model for the probability of success (using the binomial like- 
lihood) as a function of distance. Compare to your solutions of Exercises 19.2 and 
20.4. 

(b) Use posterior predictive checks to assess the fit of the model. 

6. Model building with Gaussian processes: 

(a) Replicate the birthday analyses from Section 21.2. Build up the model by adding 
covariance functions one by one. 

(b) The day-of-week and seasonal effects appear to be increasing over time. Expand the 
model to allow the day-of-year effects to increase over time in a similar way. Fit the 
expanded model to the data and graph and discuss the results. 

7. Hierarchical model and Gaussian process: Repeat Exercise 20.5 using a Gaussian process 
instead of a spline model for the underlying time series. 

8. Let p(x) = a Brbp(x) with bp (a) = exp (y(x — Th)*) forh=1,...,k. 

(a) If possible, choose a prior on ((1,..., 8k) so that u ~ GP(m,k), a Gaussian process 
with mean function m and covariance function k. 

(b) Describe the exact analytic forms (if possible) for m and k. 

(c) How does the covariance function differ from letting k(x, x’) = exp (—«(x — 2’)?)? 

(d) Describe an algorithm to minimize Y% and 71,...,7, to minimize this difference. 

9. Continuing the previous problem, suppose we instead let p(x) = 3; + Gox with Gaussian 
priors placed on (; and ə. 


(a) Does this induce a Gaussian process prior on the function u(x)? 
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(b) Since we have a Gaussian process prior, does that mean that we can capture non-linear 
functions with this prior? 
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Chapter 22 


Finite mixture models 


Mixture distributions arise in practical problems when the measurements of a random vari- 
able are taken under two different conditions. For example, the distribution of heights in 
a population of adults reflects the mixture of males and females in the population, and 
the reaction times of schizophrenics on an attentional task might be a mixture of trials 
in which they are or are not affected by an attentional delay (an example discussed later 
in this chapter). For the greatest flexibility, and consistent with our general hierarchical 
modeling strategy, we construct such distributions as mixtures of simpler forms. For ex- 
ample, it is best to model male and female heights as separate univariate, perhaps normal, 
distributions, rather than a single bimodal distribution. This follows our general principle 
of using conditioning to construct realistic probability models. The schizophrenic reaction 
times cannot be handled in the same way because it is not possible to identify which trials 
are affected by the attentional delay. Mixture models can be used in problems of this type, 
where the population of sampling units consists of a number of subpopulations within each 
of which a relatively simple model applies. In this chapter we discuss methods for analyzing 
data using mixture models. 

The basic principle for setting up and computing with mixture models is to introduce 
unobserved indicators—random variables, which we usually label as a vector or matrix 
z, that specify the mixture component from which each particular observation is drawn. 
Thus the mixture model is viewed hierarchically; the observed variables y are modeled 
conditionally on the vector z, and the vector z is itself given a probabilistic specification. 
Often it is useful to think of the mixture indicators as missing data. Inferences about 
quantities of interest, such as parameters within the probability model for y, are obtained 
by averaging over the distribution of the indicator variables. In the simulation framework, 
this means drawing (0, z) from their joint posterior distribution. 


22.1 Setting up and interpreting mixture models 
Finite mixtures 


Suppose that, based on substantive considerations, it is considered desirable to model the 
distribution of y = (yi,-.-,Yn), or the distribution of y|x, as a mixture of H components. 
It is assumed that it is not known which mixture component underlies each particular 
observation. Any information that makes it possible to specify a nonmixture model for 
some or all of the observations, such as sex in our discussion of the distribution of adult 
heights, should be used to simplify the model. For h = 1,...,H, the h-th component 
distribution, fa(yilôn), is assumed to depend on a parameter vector 0p; the parameter 
denoting the proportion of the population from component h is Ap, with + 4 An = 1. It is 
common to assume that the mixture components are all from the same parametric family, 
such as the normal, with different parameter vectors. The sampling distribution of y in 
that case is 

P(yilO, A) = Arf (yilO1) + A2 f(yil2) +... + An f (yilu). (22.1) 


519 
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The form of the sampling distribution invites comparison with the standard Bayesian 
setup. The mixture distribution with probabilities A = (A1,...,AH) might be thought of 
as a discrete prior distribution on the parameters 0p; however, it seems more appropriate 
to think of this prior, or mixing, distribution as a description of the variation in 0 across 
the population of interest. In this respect the mixture model more closely resembles a 
hierarchical model. This resemblance is enhanced with the introduction of unobserved (or 
missing) indicator variables z;;,, with 


So 1 if the ith unit is drawn from the hth mixture component 
á 0 otherwise. 


Given A, the distribution of each vector z; = (zi1;,..-,2ig) is Multin(1; à1,..., Ag). In 
this case the mixture parameters À are thought of as hyperparameters determining the 
distribution of z. The joint distribution of the observed data y and the unobserved indicators 
z conditional on the model parameters can be written 


n H 
p(y, 218, A) = pel pluie, 0) = [JT] Ont (uilOn))*, (22.2) 


i=1 h=1 


with exactly one of zip equaling 1 for each i. At this point, H, the number of mixture 
components, is assumed to be known and fixed. We consider this issue further when dis- 
cussing model checking. If observations y are available for which their mixture components 
are known (for example, the heights of a group of adults whose sexes are recorded), the 
mixture model (22.2) is easily modified; each such observation adds a single factor to the 
product with a known value of the indicator vector zi. 


Continuous mixtures 


The finite mixture is a special case of the more general specification, p(y;) = f ply:l0)A(8)dð. 
The hierarchical models of Chapter 5 can be thought of as continuous mixtures in the sense 
that each observable y; is a random variable with distribution depending on parameters 6;; 
the prior distribution or population distribution of the parameters 6; is given by the mixing 
distribution \(@). Continuous mixtures were used in the discussion of robust alternatives 
to standard models; for example, the ¢ distribution, a mixture on the scale parameter of 
normal distributions, yields robust alternatives to normal models, as discussed in Chapter 
17. Those negative binomial and beta-binomial distributions are discrete distributions that 
are obtained as continuous mixtures of Poisson and binomial distributions, respectively. 

The computational approach for continuous mixtures follows that for finite mixtures 
closely (see also the discussion of computational methods for hierarchical models in Chap- 
ter 5). We briefly discuss the setup of a probability model based on the ¢ distribution with 
v degrees of freedom in order to illustrate how the notation of this chapter is applied to 
continuous mixtures. The observable y; given the location parameter u, variance parame- 
ter o°, and scale parameter z; (similar to the indicator variables in the finite mixture) is 
N(u,072;). The location and variance parameters can be thought of as the mixture compo- 
nent parameters 0. The z; are viewed as a random sample from the mixture distribution, 
in this case a scaled Inv-y?(v,1) distribution. The marginal distribution of y;, after aver- 
aging over 2;, is t,(u,07). The degrees of freedom parameter, v, which may be fixed or 
unknown, describes the mixing distribution for this continuous mixture in the same way 
that the multinomial parameters A do for finite mixtures. The posterior distribution of the 
zi’s may also be of interest in this case for assessing which observations are possible outliers. 
In the remainder of this chapter we focus on finite mixtures; the modifications required for 
continuous mixtures are typically minor. 
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Identifiability of the mixture likelihood 


Parameters in a model are not identified if the same likelihood function is obtained for more 
than one choice of the model parameters. All finite mixture models are nonidentifiable in 
one sense; the distribution is unchanged if the group labels are permuted. For example, 
there is ambiguity in a two-component mixture model concerning which component should 
be designated as component 1 (see the discussion of aliasing in Section 4.3). When possible, 
the parameter space should be defined to clear up any ambiguity, for example by specifying 
the means of the mixture components to be in nondecreasing order or specifying the mixture 
proportions Az to be nondecreasing. For many problems, an informative prior distribution 
has the effect of identifying specific components with specific subpopulations. 


Prior distribution 


The prior distribution for the finite mixture model parameters (6, A) is taken in most ap- 
plications to be a product of independent prior distributions on 0 and A. If the vector of 
mixture indicators z; = (2i1,..., Zi) is modeled as multinomial with parameter A, then the 
natural conjugate prior distribution is the Dirichlet, A ~ Dirichlet(a1,...,a). The relative 
sizes of the Dirichlet parameters a, describe the mean of the prior distribution for A, and 
the sum of the a;,’s is a measure of the strength of the prior distribution, the ‘prior sample 
size.’ We use 0 to represent the vector consisting of all of the parameters in the mixture 
components, 6 = (01,...,@#). Some parameters may be common to all components and 
other parameters specific to a single component. For example, in a mixture of normals, we 
might assume that the variance is the same for each component but that the means differ. 
For now we do not make any assumptions about the prior distribution, p(@). In continuous 
mixtures, the parameters of the mixture distribution (for example, the degrees of freedom 
in the t model) require a hyperprior distribution. 


Ensuring a proper posterior distribution 


As has been emphasized throughout, it is critical to check before applying an improper 
prior distribution that the resulting model is well specified. An improper noninformative 
prior distribution for \ (corresponding to a; = 0) may cause a problem if the data do not 
indicate that all H components are present in the data. It is more common for problems 
to arise if improper prior distributions are used for the component parameters. In Section 
4.3, we mention the difficulty in assuming an improper prior distribution for the separate 
variances of a mixture of two normal distributions. There are a number of ‘uninteresting’ 
modes that correspond to a mixture component consisting of a single observation with no 
variance. The posterior distribution of the parameters of a mixture of two normals is proper 
if the ratio of the two unknown variances is fixed or assigned a proper prior distribution, 
but not if the parameters (log c1, log o2) are assigned a joint uniform prior density. 


Number of mixture components 


For finite mixture models there is often uncertainty concerning the number of mixture 
components H to include in the model. Computing models with large values of H can 
be sufficiently expensive that it is desirable to begin with a small mixture and assess the 
adequacy of the fit. It is often appropriate to begin with a small model for scientific reasons 
as well, and then determine whether some features of the data are not reflected in the 
current model. The posterior predictive distribution of a suitably chosen test quantity can 
be used to determine whether the current number of components describes the range of 
observed data. The test quantity must be chosen to measure aspects of the data that are 
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not sufficient statistics for model parameters. An alternative approach is to view H as a 
hierarchical parameter that can attain the values 1,2,3,... and average inferences about y 
over the posterior distribution of mixture models. 


More general formulation 


Finite mixtures arise through supposing that each of the 7 = 1,...,n items in the sample 
belong to one of H subpopulations, with each latent subpopulation or latent class having 
a different value of one or more parameters in a parametric model. Let z; € {1,...,H} 
denote the subpopulation index for item i, with this index commonly referred to as the 
latent class status. Then, the response y; for item 7 conditionally on z; has the distribution, 


with f(0, p) a parametric sampling distribution with parameters ¢ that do not vary across 
the subpopulations and parameters 0, corresponding to subpopulation h, for h = 1,..., H. 


Supposing that the proportion of the population belonging to subpopulation h is Pr(z;= 
h) = Tp, the following likelihood is obtained in marginalizing out the latent class status: 


A 
g(y|™, 0, >) = 5 Th f (ylOn, $), 


h=1 


which corresponds to a finite mixture with H components, with component h assigned prob- 
ability weight mh. In general, g can approximate a much broader variety of true likelihoods 
than can f. To illustrate this, consider the simple example in which f(y|0,¢) = N(y|6, 7), 
so that in subpopulation h the data follow the normal likelihood (y;|z; =h) ~ N(@n, ¢?). 
The normal assumption is restrictive in implying that the distribution is unimodal and 
symmetric, with a particular fixed shape. However, by allowing the mean of the normal 
distribution to vary across subpopulations, we obtain a flexible location mixture of normals 
model in marginalizing out the latent subpopulation status: 


A 
g(ylr, 0, p) = X. taN(ylOn, 8), (22.4) 


h=1 


Any density can be accurately approximated using a mixture of sufficiently many normals. 
By allowing the variance in the normal kernel to also vary across components by placing an 
h subscript on ¢, one instead obtains a location-scale mixture of normals. Location-scale 
mixtures can often obtain better accuracy using fewer mixture components than location 
mixtures. In addition, they may be preferred in allowing the sampling variability to differ 
across subpopulations, potentially leading to more realistic characterization. 


Mixtures as true models or approximating distributions 


There are two schools of thought in considering finite mixture models. The first viewpoint 
is that the incorporation of latent subpopulations is a realistic characterization of the true 
data-generating mechanism, and that such subpopulations really exist. In such a case there 
may even be interest in performing inferences on these subpopulations and in attempting to 
cluster individuals based on their subpopulation membership. This viewpoint has motivated 
a vast literature on latent class modeling and model-based clustering. Our own take is 
that, while it may sometimes be useful to cluster data and attempt to identify possible 
subpopulations as an exploratory data analysis and hypothesis generating tool, such tasks 
are intrinsically entirely sensitive to the parametric model f used to characterize variability 
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within a subpopulation. Hence, to the extent that the true parametric model is never known, 
one can never fully trust the estimated cluster-specific parameters or clusters obtained. For 
example, even if one simply changed f from a normal distribution to a t, very different 
clusters can result. Other example is the multivariate case, where a diagonal covariance 
of the componenets can lead to an inflation in the number of clusters, with true elliptical 
subpopulations split into several spherical subgroups to fit the data. 

The second school of thought is that trying to infer latent subpopulations is an intrin- 
sically ill-defined statistical problem, but finite mixture models are nonetheless useful in 
providing a highly flexible class of probability models that can be used to build more re- 
alistic hierarchical models that better account for one’s true uncertainty about parametric 
choices for random effects, error distributions and other aspects of the model. Finite mixture 
models can be used broadly for univariate or multivariate density estimation, classification, 
and nonparametric regression. 


Basics of computation for mixture models 


We can fit mixture models using the same ideas as with hierarchical models, obtaining 
inferences averaging over the mixture indicators which are thought of as nuisance parameters 
or alternatively as ‘missing data.’ An application to an experiment in psychology is analyzed 
in detail in Section 22.2. 


Crude estimates 


Initial estimates of the mixture component parameters and the relative proportion in each 
component can be obtained using various simple techniques. Graphical or clustering meth- 
ods can be used to identify tentatively the observations drawn from the various mixture 
components. Ordinarily, once the observations have been classified, it is straightforward 
to obtain estimates of the parameters of the different mixture components. This type of 
analysis completely ignores the uncertainty in the indicators and thus can overestimate 
the differences between mixture components. Crude estimates of the indicators—that is, 
estimates of the mixture component to which each observation belongs—can also be ob- 
tained by clustering techniques. However, crude estimates of the indicators are not usually 
required because they can be averaged over in EM or drawn as the first step in a Gibbs 
sampler. 


Posterior modes and marginal approximations using EM and variational Bayes 


With mixture models, it is not useful to find the joint mode of parameters and indicators. 
Instead, the EM algorithm can easily be used to estimate the parameters of a finite mixture 
model, averaging over the indicator variables. This approach also works for continuous 
mixtures, averaging over the continuous mixture variables. In either case, the E-step re- 
quires the computation of the expected value of the sufficient statistics of the joint model of 
(y, z), using the log of the complete-data likelihood (22.2), conditional on the last guess of 
the value of the mixture component parameters 0 and the mixture proportions À. In finite 
mixtures this is often equivalent to computing the conditional expectation of the indicator 
variables by Bayes’ rule. For some problems, including the schizophrenic reaction times 
example discussed later in this chapter, the ECM algorithm or some other EM alternative 
is useful. It is important to find all of the modes of the posterior distribution and assess 
the relative posterior masses near each mode. We suggest choosing a fairly large number 
(perhaps 50 or 100) starting points by simplifying the model or random sampling. To obtain 
multiple starting points, a single crude estimate as in the previous paragraph is not enough. 
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Instead, various starting points might be obtained by adding randomness to the crude esti- 
mate or from a simplification of the mixture model, for example eliminating random effects 
and other hierarchical parameters (as in the example in Section 22.2 below). 

Similar computations yield variational Bayes approximations to the marginal distribu- 
tions of the parameters in the model, again averaging over the latent mixture indicators. 


Posterior simulation using the Gibbs sampler 


Starting values for the Gibbs sampler can be obtained via importance resampling from a 
suitable approximation to the posterior (a mixture of t4 distributions located at the modes). 
For mixture models, the Gibbs sampler alternates two major steps: obtaining draws from 
the distribution of the indicators given the model parameters and obtaining draws from 
the model parameters given the indicators. The second step may itself incorporate several 
steps to update all the model parameters. Given the indicators, the mixture model reduces 
to an ordinary, (possibly hierarchical) model, such as we have already studied. Thus the 
use of conjugate families as prior distributions can be helpful. Obtaining draws from the 
distribution of the indicators is usually straightforward: these are multinomial draws in 
finite mixture models. Modeling errors hinted at earlier, such as incorrect application of an 
improper prior density, are often found during the iterative simulation stage of computa- 
tions. For example, a Gibbs sequence started near zero variance may never leave the area. 
Identifiability problems may also become apparent if the Gibbs sequences appear not to 
converge because of aliasing in permutations of the components. 


Posterior inference 


When the Gibbs sampler has reached approximate convergence, posterior inferences about 
model parameters are obtained by ignoring the drawn indicators. The posterior distribution 
of the indicator variables contains information about the likely components from which each 
observation is drawn. The fit of the model can be assessed by a variety of posterior predictive 
checks, as we illustrate in Section 22.2. If robustness is a concern, the sensitivity of inferences 
to the assumed parametric family can be evaluated using the methods of Chapter 17. 


22.2 Example: reaction times and schizophrenia 


We illustrate simple mixture models (where ‘simple’ refers to there being a fixed number 
of mixture components with specified distributions) with an application to an experiment 
in psychology. Each of 17 people—11 non-schizophrenics and 6 schizophrenics—had their 
reaction times measured 30 times. We present the data in Figure 22.1 and briefly review 
the basic statistical approach here. 

It is clear from the graphs that response times are higher on average for schizophrenics. 
In addition, the response times for at least some of the schizophrenic individuals are con- 
siderably more variable than the response times for the others. Psychological theory from 
the last half century and before suggests a model in which schizophrenics suffer from an 
attentional deficit on some trials, as well as a general motor reflex retardation; both aspects 
lead to relatively slower responses for the schizophrenics, with motor retardation affecting 
all trials and attentional deficiency only some. 


Initial statistical model 


To address the questions of scientific interest, we fit the following basic model, basic in 
the sense of minimally addressing the scientific knowledge underlying the data. Response 
times for non-schizophrenics are described by a normal random-effects model, in which the 
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Figure 22.1 Logarithms of response times (in milliseconds) for 11 non-schizophrenic individuals 
(above) and 6 schizophrenic individuals (below). All histograms are on a common scale, and there 
are 80 measurements for each person. 


responses of person j = 1,...,11 are normally distributed with distinct person mean a; 
and common variance 07. 


Finite mixture likelihood model. To reflect the attentional deficiency, the response times 
for each schizophrenic individual 7 = 12,...,17 are modeled as a two-component mixture: 
with probability (1 — A) there is no delay, and the response is normally distributed with 
mean a; and variance a and with probability À responses are delayed, with observations 
having a mean of a; + 7 and the same variance, on Because reaction times are all positive 
and their distributions are positively skewed, even for non-schizophrenics, the above model 
was fitted to the logarithms of the reaction time measurements. 


Hierarchical population model. The comparison of the components of a = (a1,...,@17) for 
schizophrenics versus non-schizophrenics addresses the magnitude of schizophrenics’ motor 
reflex retardation. We include a hierarchical parameter 2 measuring this motor retardation. 
Specifically, variation among individuals is modeled by having the means qa; follow a normal 
distribution with mean yz for non-schizophrenics and u + 8 for schizophrenics, with each 
distribution having a variance of 02. That is, the mean of a; in the population distribution 
is u + Sj, where S} is an observed indicator variable that is 1 if person j is schizophrenic 
and 0 otherwise. 

The three parameters of primary interest are: 3, which measures motor reflex retarda- 
tion; A, the proportion of schizophrenic responses that are delayed; and 7, the size of the 
delay when an attentional lapse occurs. 


Mixture model expressed in terms of indicator variables. Letting y;; be the ith response of 
individual j, the model can be written hierarchically:. 
Yi |Ol55 Zij, Q ~N N(a; Tes) 


ajiz, ~ N(w+BS;,0%), 
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zijl ~ Bernoulli(AS;), 


where @ = (o2, 2, À, T, 4, 02), and 2; is an unobserved indicator variable that is 1 if measure- 
ment i on person j arose from the delayed component and 0 if it arose from the undelayed 
component. In the following description, we occasionally use 0 = (a, @) to represent all of 
the parameters except the indicators. 

The indicators z;; are not necessary to formulate the model but simplify the conditional 
distributions in the model, allowing us to use ECM and Gibbs for easy computation. Be- 
cause there are only two mixture components, we only require a single indicator, zij, for 
each observation, yij. In our general notation, H = 2, the mixture probabilities are A; = A 
and Az =1—,, and the corresponding mixture indicators are ziji = Zij and zijo = l — Zij. 


Hyperprior distribution. We start by assigning a noninformative uniform joint prior density 
on ¢. In this case the model is not identified, because the trials unaffected by a positive 
attentional delay could instead be thought of as being affected by a negative attentional 
delay. We restrict T to be positive to identify the model. The variance components cĉ 
and oy are restricted to be positive as well. The mixture component A is actually taken 
to be uniform on [0.001, 0.999] as values of zero or one would not correspond to mixture 
distributions. Science and previous analysis of the data suggest that a simple model without 


the mixture is inadequate for this dataset. 


Crude estimate of the parameters 


The first step in the computation is to obtain crude estimates of the model parameters. For 
this example, each a; can be roughly estimated by the sample mean of the observations on 
person j, and o? can be estimated by the average sample variance within non-schizophrenics. 
Given the estimates of œj, we can obtain a quick estimate of the hyperparameters by dividing 
the a;’s into two groups, non-schizophrenics and schizophrenics. We estimate u by the 
average of the estimated a,;’s for non-schizophrenics, 6 by the average difference between 
the two groups, and oĉ by the variance of the estimated a,;’s within groups. We crudely 
estimate \ = ł and 7 = 1.0 based on a visual inspection of the lower 6 histograms in 22.1, 
which display the schizophrenics’ response times. It is not necessary to create a preliminary 
estimate of the indicator variables, z;;, because we update zij as the first step in the ECM 
and Gibbs sampler computations. 


Finding the modes of the posterior distribution using ECM 


We devote the next few pages to implementation of ECM and Gibbs sampler computations 
for the finite mixture model. The procedure looks difficult but in practice is straightforward, 
with the advantage of being easily extended to more complicated settings. 

We draw 100 points at random from a simplified distribution for ọ and use each as a 
starting point for the ECM maximization algorithm to search for modes. The simplified 
distribution is obtained by adding some randomness to the crude parameter estimates. 
Specifically, to obtain a sample from the simplified distribution, we start by setting all 
the parameters at the crude point estimates above and then divide each parameter by an 
independent x? random variable in an attempt to ensure that the 100 draws were sufficiently 
spread out so as to cover the modes of the parameter space. 

The ECM algorithm has two steps. In the E-step, we determine the expected joint log 
posterior density, averaging z over its posterior distribution, given the last guessed value of 
0%; that is, the expression Eoiq refers to averaging z over the distribution p(z|@°'¢, y). For 
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our hierarchical mixture model, the expected ‘complete-data’ log posterior density is 


17 
Eoia(log p(z, Oly) = constant + X` log(N(a;|u + 8S;,02)) + 
j=l 
17 30 
+ SCS /flog(N(yiglay,o4)) (1 —Eota(zig)) +log(N(yiglay + 7, 04))Eota(zig)] 
j=li=1 
17 30 
+ X9 5 [log(1 — A)(1 — Eota(zig)) + log(A)Eota (zi5)] - 
jHli=l 


For the E-step, we must compute Esja (zij) for each observation (i, j). Given 0°!¢ and y, the 
indicators z;; are independent, with conditional posteriors, 


Pr(zig=0|0",y) = 1- Gj 
Prizig= 1/0") = Gij, 


where 
¢ IN (yi; jagd + r's, a) (22 5) 
1J (1 = dold)N (vija, (o ey) + AMIN (ys; jag Ea rod, (ae) ` . 

For each 7, j, the above expression is a function of (y,@) and can be computed based on the 
data y and the current guess, 0°!¢. 

In the M-step, we must alter 6 to increase Eoja (log p(z, 0ly)). Using ECM, we alter one 
set of components at a time, for each finding the conditional maximum given the others. 
The conditional maximizing steps are easy: 


1. Update À by computing the proportion of trials by schizophrenics exhibiting delayed 
reaction times. We actually add up the possible fractional contributions of the 30 trials 
for each schizophrenic: 

17 30 


new 1 
A ~ 6-30 pe 


g=12 1=1 


2. For each j, update a; given the current values of the other parameters in 0 by combining 
the normal population distribution of a; with the normal-mixture distribution for the 
30 data points on person j: 


Sr (ue + BS3) +E SB (yig GiT) 


awa H (22.6) 
J 30 
at+biag 


3. Given the vector a, the updated estimates for 7 and o are obtained from the delayed 
components of the schizophrenics’ reaction times: 


a DE 12 Da 1 Sig (Yiz — 3) 


g = 30 
aa 12 pe 1 Çij 
17 30 
new \2 — new 
(oy P= 17 ew Vwi-« Gis r . 
j=l i=l 


4. Given the vector a, the updated estimates for the population parameters u, 6, and o2 
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follow immediately from the normal population distribution (with uniform hyperprior 
density); the conditional modes for ECM satisfy 


17 
1 
pew — T (aj _ ere. 
j=l 
new 1 1 new 
B = 3 (aj — pre”) 
jg=Hl2 
new )\2 1 x new new 2 
(og ) = 17 (a; =H -$ Sj) , (22.7) 
j=1 
which is equivalent to 
11 
new _ >? , 
H = Ti Qj 
j=1 
Le 11 
1 1 
new __ : pan : 
GEPC 
j=12 j=1 


and (g?°™¥)? in (22.7). 
After 100 iterations of ECM from each of 100 starting points, we find three local maxima of 
(a,ġ): a major mode and two minor modes. The minor modes are substantively uninter- 
esting, corresponding to near-degenerate models with the mixture parameter À near zero, 
and have little support in the data, with posterior density ratios less than e72? with respect 
to the major mode. We conclude that the minor modes can be ignored and, to the best of 
our knowledge, the target distribution can be considered unimodal for practical purposes. 


Normal and t approximations at the major mode 


The marginal posterior distribution of the model parameters 0, averaging over the indicators 
Zij, is an easily computed product of mixture forms: 


17 
poly) x [] N(alu + BS;,02) x 
j=l 
17 30 
x TT [ [a - 5, N@islaz, of) + ASN lui;la; +7, 02). (22.8) 
j=li=l1 


We compute this function while running the ECM algorithm to check that the marginal 
posterior density indeed increases at each step. Once the modes have been found, we 
construct a multivariate t4 approximation for 0, centered at the major mode with scale 
determined by the numerically computed second derivative matrix at the mode. 

We draw S = 2000 independent samples of 0 from the t4 distribution, as a starting 
distribution for importance resampling of the parameter vector 6. 

Had we included samples from the neighborhoods of the minor modes up to this point, 
we would have found them to have minuscule importance weights. 


Simulation using the Gibbs sampler 


We drew a set of ten starting points by importance resampling from the t4 approximation 
centered at the major mode to create the starting distribution for the Gibbs sampler. 
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This distribution is intended to approximate our ideal starting conditions: for each scalar 
estimand of interest, the mean is close to the target mean and the variance is greater than 
the target variance. 

The Gibbs sampler is easy to apply for our model because the full conditional posterior 
distributions—p(dla, z,y), p(ald, z,y), and p(zla, ¢, y)—have standard forms and can be 
easily sampled from. The required steps are analogous to the ECM steps used to find the 
modes of the posterior distribution. Specifically, one complete cycle of the Gibbs sampler 
requires the following sequence of simulations: 

1. For 2,;,i=1,...,30, j =12,...,17, draw z;; as independent Bernoulli(z,;;), with prob- 
abilities ¢;; defined in (22.5). The indicators z;; are fixed at 0 for the non-schizophrenics 

(j < 12). 

2. For each individual j, draw a; from a normal distribution with mean @}°™, as defined 
n (22.6), but with the factor ¢;; in that expression replaced by z;;, because we are now 

A a on z rather than averaging over it. The variance of the normal conditional 

distribution for a; is just the reciprocal of the denominator of (22.6). 

3. Draw the mixture parameter from a Beta(h + 1,180 — h + 1) distribution, where 
h= 5! j=12 yoo zij, the number of trials with attentional lapses out of 180 trials for 


schizophrenics. The simulations are subject to the constraint that is restricted to the 
interval [0.001, 0.999]. 
4. For the remaining parameters, we proceed as in the normal distribution with unknown 


mean and variance. Draws from the posterior distribution of (8, p,7T, a, 0, ) given 


(a, A, z) are obtained by first sampling from the marginal posterior distribution of the 
variance parameters and then sampling from the conditional posterior distribution of the 
others. First, 


17 30 
1 
osla, à, z ~ Inv-x? | 508, 508 5 DDO —aj—zjT) |, 


j=1 i=1 


17 
1 
2 2 eae ee .)2 
o4|a,A,z ~ Inv-x? | 15, Te u — BS;) 


Then, conditional on the variances, T can be simulated from a normal distribution, 


2 ioe 1 Žij (Vij — aj) o ) 
7 
DA pA 1 Žij pail 12 pe 1 *ij 


The conditional distribution of u given all other parameters is normal and depends only 
on a, 8, and o2: 


2 2 
TON z, oy oa ~ N ( 


1 1 
2 2 2 
ula, Ay z, 6,0404 ~N mà — BS;), Ta 
Finally, 6 also has a normal conditional distribution: 
Lie 1 
Bla, A, 2, m0 oa ~N | = $ (a-u), zoa | - (22.9) 
j=12 


As is common in conjugate models, the steps of the Gibbs sampler are simply stochastic 
versions of the ECM steps; for example, variances are drawn from the relevant scaled inverse- 
x? distributions rather than being set to the posterior mode, and the centers of most of the 
distributions use the ECM formulas for conditional modes with ¢;;’s replaced by 2;;’s. 
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Old model New model 
Parameter 2.5% median 97.5% 2.5% median 97.5% 
Xr 0.07 0.12 0.18 0.46 0.64 0.88 
T 0.74 0.85 0.96 0.21 0.42 0.60 
B 0.17 0.32 0.48 0.07 0.24 0.43 
w fixed at 1 0.24 0.56 0.84 


Table 22.1 Posterior quantiles for parameters of interest under the old and new misture models 
for the reaction time experiment. Introducing the new mixture parameter w, which represents 
the proportion of schizophrenics with attentional delays, changes the interpretation of the other 
parameters in the model. 


Possible difficulties at a degenerate point 


If all the z,;’s are zero, then the mean and variance of the conditional distribution (22.9) are 
undefined, because 7 has an improper prior distribution and, conditional on D Zij = 0, 
there are no delayed reactions and thus no information about 7. Strictly speaking, this 
means that our posterior distribution is improper. For the data at hand, however, this 
degenerate point has extremely low posterior probability and is not reached by any of our 
simulations. If the data were such that ar Zij = 0 were a realistic possibility, it would be 
necessary to assign an informative prior distribution for 7. 


Inference from the iterative simulations 


In the mixture model example, we computed several univariate estimands: the 17 random 
effects a; and their standard deviation ca, the shift parameters 7 and £, the standard devi- 
ation of observations o,, the mixture parameter À, the ratio of standard deviations Ca/Sy, 
and the log posterior density. After an initial run of ten sequences for 200 simulations, we 
computed the estimated potential scale reductions, R, for all scalar estimands, and found 
them all to be below 1.1; we were thus satisfied with the simulations (see the discussion 
in Section 11.4). The potential scale reductions were estimated on the logarithmic scale 
for the variance parameters and the logit scale for A. We obtain posterior intervals for all 
quantities of interest from the quantiles of the 1000 simulations from the second halves of 
the sequences. 

The first three columns of Table 22.1 display posterior medians and 95% intervals from 
the Gibbs sampler simulations for the parameters of most interest to the psychologists: 


e à, the probability that an observation will be delayed for a schizophrenic subject to 
attentional delays; 


e 7, the attentional delay (on the log scale); 


e 6, the average log response time for the undelayed observations of schizophrenics minus 
the average log response time for non-schizophrenics. 

For now, ignore the final row and the final three columns of Table 22.1. Under this model, 

there is strong evidence that the average reaction times are slower for schizophrenics (the 

factor of exp(), which has a 95% posterior interval of [1.18,1.62]), with a fairly infre- 

quent (probability in the range [0.07,0.18]) but large attentional delay (exp(7) in the range 

(2.10, 2.61]). 


Posterior predictive distributions 


We obtain draws from the posterior predictive distribution by using the draws from the 
posterior distribution of the model parameters 0 and mixture components z. Two kinds of 
predictions are possible: 
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1. For additional measurements on a person j in the experiment, posterior simulations fij 
should be made conditional on the individual parameter a; from the posterior simula- 
tions. 


2. For measurements on an entirely new person j, one should first draw a new parameter 
a; conditional on the hyperparameters, 4, 3,02. Simulations of measurements ği; can 
then be performed conditional on a. That is, for each simulated parameter vector 6°, 
draw a*, then g5. 


The different predictions are useful for different purposes. For checking the fit of the model, 
we use the first kind of prediction so as to compare the observed data with the posterior 
predictive distribution of the measurements on those particular individuals. 


Checking the model 


Defining test quantities to assess aspects of poor fit. The model was chosen to fit accurately 
the unequal means and variances in the two groups of people in the study, but there was still 
some question about the fit to individuals. In particular, the histograms of schizophrenics’ 
reaction times (Figure 22.1) indicate that there is substantial variation in the within-person 
response time variance. To investigate whether the model can explain this feature of the 
data, we compute sj, the standard deviation of the 30 log reaction times y;;, for each 
schizophrenic individual 7 = 12,...,17. We then define two test quantities: Tmin and 
Tmax, the smallest and largest of the six values sj. To obtain the posterior predictive 
distribution of the two test quantities, we simulate predictive datasets from the normal- 
mixture model for each of the 1000 simulation draws of the parameters from the posterior 
distribution. For each of those 1000 simulated datasets, y"°P, we compute the two test 
quantities, Tmin(y™°P), Tmax (y™°P). 


Graphical display of realized test quantities in comparison to their posterior predictive dis- 
tribution. In general, we can look at test quantities individually by plotting histograms 
of their posterior predictive distributions, with the observed value marked. In this case, 
however, with exactly two test quantities, it is natural to look at a scatterplot of the joint 
distribution. Figure 22.2 displays a scatterplot of the 1000 simulated values of the test 
quantities, with the observed values indicated by an x. With regard to these test quanti- 
ties, the observed data y are atypical of the posterior predictive distribution—Tmin is too 
low and Tmax is too high—with estimated p-values of 0.000 and 1.000 (to three decimal 
places). 


Example of a poor test quantity. In contrast, a test quantity such as the average value of 
sj is not useful for model checking since this is essentially the sufficient statistic for the 
model parameter a and thus the model will automatically fit it well. 


Expanding the model 


An attempt was made to fit the data more accurately by adding two further parameters to 
the model, one parameter to allow some schizophrenics to be unaffected by attentional de- 
lays, and a second parameter that allows the delayed observations to be more variable than 
undelayed observations. We add the parameter w as the probability that a schizophrenic 
individual has attentional delays and the parameter Ory as the variance of attention-delayed 
measurements. In the expanded model, we give both these parameters uniform prior dis- 
tributions. The model we have previously fitted can be viewed as a special case of the new 
model, with w = 1 and o}, = 0%. 

For computational purposes, we also introduce another indicator variable, W;, that is 1 if 
individual j is prone to attention delays and 0 otherwise. The indicator W; is automatically 
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Figure 22.2 Scatterplot of the posterior predictive distribution of two test quantities: the smallest 


and largest observed within-schizophrenic variances. The x represents the observed value of the 
test quantity in the dataset. 


0 for non-schizophrenics and is 1 with probability w for each schizophrenic. Both of these 
parameters are appended to the parameter vector 0 to yield the model, 


Yij|zij 0 ~ N(aj+ Tzij, (1 -— Rig Oy + Zig O42) 
aj|z,5,H,8,02 ~ N(u+BS;,02) 
zij|S,W,8 ~ Bernoulli(AS;W;) 
W;|S,0 ~ Bernoulli(w5;). 


It is straightforward to fit the new model by just adding three new steps in the Gibbs 
sampler to update w, T22, and W. In addition, the Gibbs sampler steps for the old param- 
eters must be altered somewhat to be conditional on the new parameters. We do not give 
the details here but just present the results. We use ten randomly selected draws from the 
previous posterior simulation as starting points for ten parallel runs of the Gibbs sampler 
(values of the new parameters are drawn as the first Gibbs steps). Ten simulated sequences 
each of length 500 were sufficient for approximate convergence, with estimated potential 
scale reductions less than 1.1 for all model parameters. As usual, we discarded the first half 
of each sequence, leaving a set of 2500 draws from the posterior distribution of the larger 
model. 

Before performing posterior predictive checks, it makes sense to compare the old and 
new models in their posterior distributions for the parameters. The last three columns of 
Table 22.1 display inferences for the parameters of applied interest under the new model and 
show significant differences from the old model. Under the new model, a greater proportion 
of schizophrenic observations is delayed, but the average delay is shorter. We have also 
included a row in the table for w, the probability that a schizophrenic will be subject to 
attentional delays, which was fixed at 1 in the old model. Since the old model is nested 
within the new model, the differences between the inferences suggest a real improvement in 
fit. 


Checking the new model 


The expanded model is an improvement, but how well does it fit the data? We expect that 
the new model should show an improved fit with respect to the test quantities considered 
in Figure 22.2, since the new parameters describe an additional source of person-to-person 
variation. (The new parameters have substantive interpretations in psychology and are not 
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Figure 22.3 Scatterplot of the posterior predictive distribution, under the expanded model, of two test 


quantities: the smallest and largest within-schizophrenic variance. The x represents the observed 
value of the test quantity in the dataset. 


merely ‘curve fitting.’) We check the fit of the expanded model using posterior predictive 
simulation of the same test quantities under the new posterior distribution. The results are 
displayed in Figure 22.3, based on posterior predictive simulations from the new model. 

Once again, the x indicates the observed test quantity. Compared to Figure 22.2, the 
x is in the same place, but the posterior predictive distribution has moved closer to it. 
(The two figures are on different scales.) The fit of the new model, however, is by no 
means perfect: the x is still in the periphery, and the estimated p-values of the two test 
quantities are 0.97 and 0.05. Perhaps most important, the lack of fit has greatly diminished 
in magnitude, as can be seen by examining the scales of Figures 22.2 and 22.3. We are left 
with an improved model that still shows some lack of fit, suggesting possible directions for 
improved modeling and data collection. 


22.3 Label switching and posterior computation 


Due to identifiability issues, such as the so-called label ambiguity and label switching prob- 
lem (already mentioned briefly in Section 4.3), it makes an important difference in con- 
ducting the analysis and defining priors whether there is interest in inferences on mixture 
component-specific parameters and clustering. The label ambiguity problem refers to the 
issue in which there is nothing in the likelihood to distinguish mixture component h as 
different from h’. For example, if we have a two-component mixture model with 71 =0.2, 
6; =0, m2 = 0.8 and 62 = —1, then this model is equivalent to the model with mı = 0.8, 
6, =—1, m2=0.2 and 62=0. In general, one can permute the labels on the different mixture 
components without having any impact on the likelihood. This issue becomes apparent 
when using the EM algorithm to maximize the likelihood for a finite mixture model. If the 
algorithm converges to a point estimate (71,...,7z), (61, ..., Ôu), È, there will be another 
estimate with an equivalent likelihood corresponding to (îi; ---; ku), (Ôk; Âk), Q, 
where (K1,...,#) is any permutation of the indexes {1,..., k}. 
A Bayesian model requires a joint prior distribution for (m1,..., TH), (01,-..,9H), @. 
If the mixture components {1,..., H} are exchangeable in the prior distribution, then the 
marginal posterior distribution of @;, will be identical for all h. Hence, one cannot mean- 
ingfully estimate a posterior distribution for mixture component (subpopulation) h without 
somehow defining component h as different from the others. A typical exchangeable prior 
would let 
(T1,... TH) ~ Dirichlet(a,...,a) and 6, ~ Po, (22.10) 
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independently, with Po an arbitrary common prior from which the mixture component- 
specific parameters are drawn. We initially treat the number of mixture components H as 
known, but in the next section we address the issue of estimating and allowing uncertainty 
in H. A Bayesian specification is completed with a prior for ¢. 

Based on prior (22.10), we can define a simple data augmentation algorithm for posterior 
computation. For simplicity in illustration, focus on the case in which we have a univariate 
location-scale mixture of Gaussians with 


yil2i~ N(uz T2) Pr(z =h) =T, t=1,...,n, (22.11) 
with the prior specified as in (22.10) with Po chosen as conditionally conjugate with 
Ltn, Th ~ N(un|uo, &T”)Inv-gamma(r?|a,,b,), h=1,...,H. (22.12) 


In this case, we can define a Gibbs sampling algorithm that alternates between imputing 
the subpopulation indexes {z;} for each item, updating the parameters specific to each 
subpopulation from the conjugate normal-gamma conditional posterior distribution, and 
updating the mixture component probabilities from the Dirichlet conditional posterior. 
There are no unknown global parameters ¢. The specific steps are as follows: 


1. Update z; from its multinomial conditional posterior with 


TrN(yiltnsT, ) 
Pr(z;=h| —) = = a 
Xia TAN (yal Ma; 7; ) 


2. Update (u,,77) from its conditional posterior distribution, 


h=1,... 


plun, Thl—) = N(walfin, Rrk )Inv-gamma(rk lân, , bry); 


, Lin = R(K— | uo oI Nn) Gry, = a,+np/2, 

A 1 Nh 

bo, = be += pay + = | GaP 
i (2o Tn) (ZS) m mF), 


where np, = J`; 1z,=n denotes the number of items allocated to component h in the 
current iteration, and Y, = ni, J i:z;—n Yi is the mean within component h. 


3. Update m,...,7H#|— ~ Dir(a + nı,...,a +Nng). 


Each of these steps is simple to implement. After running to approximate convergence, we 
can examine the density, gly) = ae mtnN(y|un,77), at a dense grid of y values across 
many simulations draws, and estimate the density using = Ta T n®N(yl po), rf? a 
with the superscripts denoting the s-th saved simulation draw. 

Alternatively, one can assess convergence for the mixture component-specific parameters 
(Hh, Th). However, we recommend avoiding this due to label switching issues that lead to a 
type of poor mixing in the mixture component-specific parameters, which may not impact 
convergence and mixing of the induced density g(y). In particular, due to exchangeability 
of the mixture components, we know in advance that the marginal posterior distribution 
of (un, Th) is identical for all h € {1,...,H} and hence the chains for each of the mixture 
component-specific parameters have the same target distribution. Consider the example 
we discussed above in which one mixture component is located at ~=0, one component is 
located at ~=—1 and we fix H=2. The Gibbs samples for pı should then randomly jump 
between values close to 0 and values close to —1 if mixing is good. The posterior for py, 
has a multimodal form with one mode close to 0 and one mode close to —1. If these modes 
are well separated and there is a region of low probability density between the modes, as 
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is commonly the case in applications, then the Gibbs sampler will remain stuck for long 
intervals in one mode. For example, for the first 5000 iterations the 41 samples may be 
close to 0 and u2 samples close to —1, and then the labels may suddenly switch, with the 
[41 samples then close to —1 and the u2 samples close to 0. A well-mixing Gibbs sampler 
should switch between these modes often, but in practice for well separated components, it 
is common to remain stuck in one labeling across all the samples that are collected. One 
can argue that the Gibbs sampler has failed in such a case, but the practical reality is 
that samples of unknowns based on marginalizing out the mixture component labels, such 
as the density g, often exhibit good rates of convergence and mixing even if the mixture 
component labels do not. 

In our experience based on conducting a wide variety of simulation studies and real 
data analyses, the Bayes density estimate using the above finite mixture model and Gibbs 
sampler for sufficiently large H and a=1/H (to be justified below) has excellent frequency 
properties with lower mean integrated squared error than typical kernel density estimators. 
In allowing the normal kernel bandwidth to differ across mixture components, the approach 
automatically allows a locally adaptive degree of smoothness in the density. The intrinsic 
Bayes penalty for model complexity allows the local bandwidths to be automatically chosen, 
while allowing uncertainty in their estimation, without relying on cross-validation. The 
penalty for model complexity arises because the marginal likelihood will tend to decrease 
as items are allocated to increasing numbers of mixture components. For each new mixture 
component occupied, additional parameters are introduced in the likelihood and then one 
has to marginalize across a larger space. 

It is important to avoid choosing a Po that is improper, or even diffuse but proper, as 
the results may be sensitive to the size of the enormous variance chosen. Instead, best 
results are obtained when P is chosen to generate mixture components that are close to 
the support of the data. There are several ways to accomplish this in practice. Firstly, 
one can normalize the data in advance of the analysis to facilitate elicitation of Pj. For 
example, in the normal location-scale mixture described above after normalization we can 
let fp =0, K=1, a, =2, and b, =4 as a reasonable default. After obtaining samples from 
the posterior for the standardized density, one can easily apply a linear transformation to 
each sample to obtain samples from the unnormalized density. An alternative is to elicit 
the hyperparameters in P) based on one’s prior knowledge about the location and scale of 
the data. 

Now suppose that one wishes to do inferences on the mixture component-specific pa- 
rameters (Hh, Th). For the reasons discussed above related to label switching, it is not 
appropriate to simply calculate posterior summaries based on posterior draws of (Hh, Th). 
For example, in the two-component illustration, one would obtain the same posterior mean 
for 1 and u2 if the Gibbs sampler mixed well enough and sufficiently many samples were 
drawn. If we could somehow relabel the samples so that after relabeling all the samples of 
(41,71) correspond to the component at u = 0 and all the samples of (u2, T2) correspond 
to the component at u= —1, then we could calculate posterior summaries of the mixture 
component-specific parameters in the standard manner. A number of postprocessing al- 
gorithms have been proposed; one common idea is as follows. Run the MCMC algorithm 
ignoring the label-switching problem, obtaining samples s = 1,...,S. Letting os denote 
a permutation of {1,...,k} for sample s, choose o1,...,05 (permute the initial labels) 
to minimize a loss function L(a,6), with a an action and @ the parameters. This can be 
accomplished by iterating between choosing a to minimize the expected posterior loss con- 
ditionally on {øs} and choosing each a, to minimize the loss conditionally on the current 
a. 

Instead of using an exchangeable specification and then applying postprocessing, one can 
potentially put constraints in the prior in an attempt to make the separate mixture com- 
ponents distinguishable. For example, in the univariate density estimation case, we could 
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restrict 41 <2<-+-:<jsy in the prior distribution so that the higher indexed components 
have higher means. However, there are some clear drawbacks to such an approach. Most 
interesting models in applications are multivariate, and in multivariate cases it is not at all 
clear in most cases what type of restriction is appropriate. Even in the case of a univariate 
Gaussian location-scale mixture, it may be that we need multiple components with similar 
means but different variances to provide a good fit to the data. If the means are close 
together, then label switching can occur even if we place a strict order restriction on the 
means and hence the restriction does not fully solve the label ambiguity problem. In addi- 
tion, placing a restriction on the parameters can lead to mixing problems and bias in which 
the prior favors pushing apart of parameters, with this leading to substantial overestimation 
of the differences in the components in the presence of sizeable posterior uncertainty. 


22.4 Unspecified number of mixture components 


To allow for a variable number of mixture components H, one can potentially also choose 
a prior for H such as a truncated Poisson, with reversible jump MCMC (see Section 12.3) 
then used for posterior computation. Such an approach is computationally intensive, and 
in practice it is much more common to fit the model for varying choices of H and then 
use a goodness-of-fit criterion penalized for model complexity (see Chapter 7) to choose k. 
One computational approach is to use the EM algorithm to obtain a maximum likelihood 
estimate or posterior mode for the parameters for each of a variety of choices of H and then 
reporting the results for the H having the best marginal posterior probability or estimated 
predictive performance. As discussed in Section 7.4, the marginal posterior probability can 
be sensitive to the prior distribution, and DIC is not well justified theoretically in mixture 
models. WAIC is theoretically justified, but conducting inferences based on the selected H 
will ignore uncertainty in estimating H in the final inferences. 

Fortunately, there is a simple alternative in which posterior simulations are used to 
estimate the posterior distribution of the number of mixture components, while potentially 
averaging over this posterior in estimating the density and other quantities of interest (with 
the exception of mixture component-specific parameters). In fact, one can directly use the 
Gibbs sampler described in the previous section, but with a carefully chosen value for the 
hyperparameter a in the Dirichlet prior (22.10) for the weights on the different mixture 
components. The number of clusters H can be viewed as an upper bound on the number 
of mixture components, as some of these components may be unoccupied. For example, if 
we set H = 20 it can still be the case that all n items in the study are allocated to a small 
number of the 20 components available, for example 3 or 4. Letting H, = Sean lanso 
denote the number of occupied components, the Gibbs sampler outlined in the previous 
section produces samples from the posterior of H,, after convergence. 

Ideally, the posterior on H,, would not be sensitive to the choice of upper bound H. 
However, if a is set to 1 so that (m1,..., mg) ~ Dirichlet(1,...,1), the tendency will be to 
allocate items to many different components and as H becomes large there will be a high 
probability of putting each item in its own cluster so that H, =n. A simple alternative is to 
instead let a= 4}, so that the prior sample size no = Sja is fixed regardless of the size 
of the upper bound H, with no = 1 providing a reasonable minimally informative default. 
This turns out to work well in favoring allocation of most of the weight to a few dominant 
locations. The Dirichlet prior distribution can be obtained by letting ma = An/ Da Al 
with A, ~ Gamma(4?,1). If we sample the \;,’s with no =1 and H large, we get many 
values close to zero with a small number in the right tail. After normalization, these values 
in the tail correspond to a few high probability components with the rest getting probability 
close to zero. 

To make inferences on the mixture component-specific parameters (mh, 60n, h =1,...,H), 
it is necessary to condition on an estimate of the number of clusters H,,. Such an estimate 
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Figure 22.4 Histograms overlain with nonparametric density estimates. Top row shows galaxy data, 
bottom row shows acidity data. The three columns show Gaussian kernel density estimation, esti- 
mated densities, and estimated clusters. 


can be obtained by running an initial MCMC analysis with H chosen as an upper bound 
on the number of clusters, and then setting f, equal to the posterior mode of H. Running 
an additional simulation conditional on H = H,,, one can apply postprocessing as discussed 
in the previous section to obtain posterior summaries of the mixture component-specific 
parameters. 


Example. Simple mixtures fit to small datasets 

We illustrate the Bayesian mixture model in three simple applications from the statis- 
tics literature. The first example is a set of 82 measurements of speeds of galaxies 
diverging from our own. The second is an example of measurements of acidity from 
155 lakes in north central Wisconsin, and the third is a set of 150 observations from 
three different species of iris, each with four measurements. The galaxy and acidity 
data are one-dimensional (also used in Chapter 21, while each of the observations in 
the iris dataset are four-dimensional attributes of the plants. 

We implemented the Gibbs sampler for a finite location-scale mixture of Gaussians 
in each case using an upper bound on the number of clusters of H. This does not 
mean there will be H clusters in the data, as the prior specification favors emptying 
of redundant clusters that are not needed. As the first two datasets are univariate, 
it was thought that no more than H = 5 clusters were needed. As the number of 
clusters becomes larger than this, interpretability breaks down and it is additionally 
quite likely that two or more of these clusters are closely overlapping and essentially 
redundant. Although the prior specification favors assigning low probability to such 
unnecessary clusters, in practice a small number of observations can nonetheless be 
assigned to them leading to an overestimate of the number of clusters unless one 
ignores very small clusters. As the iris data were four-dimensional, we increased the 
upper bound H to allow for potentially more clusters (although it turned out in this 
case that no more than 6 clusters were needed for a good fit). 

To simplify prior elicitation in these applications, we standardized the data by sub- 
tracting the mean and dividing by the standard deviation. We set the Dirichlet hy- 


perparameter to a = +. As the data were normalized, we set uo = 0 and « = 1 
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Mixure component, h 1 2 3 4 5 
Weight, mh 0.66 0.16 0.06 0.09 0.03 
(0.15) (0.12) (0.04) (0.03) (0.05) 
Location, Hn 0.10 0.20 1.89 —2.35 0.02 
(0.07) (0.43) (0.67) (0.09) (0.82) 
Scale, on 0.22 0.38 0.57 0.21 0.45 


(0.05) (0.33) (0.46) (0.11) (0.49) 


Table 22.2 Posterior mean and standard deviation of weight, location, and scale parameters for the 
five mixture components fit to the galaxy data displayed in the top row of Figure 22.4. 


Mixure component, h 1 2 3 4 5 
Weight, mh 0.50 0.29 0.10 0.09 0.02 
(0.08) (0.07) (0.07) (0.07) (0.02) 
Location, Hp —0.77 1.1% —0:31 0.46 —0.13 
(0.03) (0.10) (0.47) (0.56) (0.88) 
Scale, on 0.12 0.23 0.42 0.39 0.48 


(0.03) (0.07) (0.28) (0.27) (0.44) 


Table 22.3 Posterior mean and standard deviation of weight, location, and scale parameters for the 
five mixture components fit to the acidity data displayed in the bottom row of Figure 22.4. 


to generate cluster means from a prior that introduces values at plausible locations, 
noting that choosing a highly diffuse prior for these cluster locations can lead to aber- 
rant behavior in which the posterior overly favors allocation to too few clusters. For 
the cluster-specific precisions, we chose an inverse-gamma prior having parameters 3 
and 1, respectively. This was thought to be a reasonable weakly informative choice 
considering the normalization of the data and that real data seldom require very con- 
centrated or very diffuse mixture components. For simplicity, we assumed a diagonal 
covariance for each Gaussian component in the multivariate iris example, noting that 
allowing nondiagonal covariance is more flexible but also more heavily parameterized. 
We postprocessed Gibbs draws following the above algorithm using a Kullback-Leibler 
loss. As expected, mixing was poor for these component-specific parameters due to 
the identifiability problem. For the galaxy data, the top row of Figure 22.4 shows 
three non-overlapping clusters with the one close to the origin relatively large com- 
pared to the others. Although this large cluster might be interpreted as two highly 
overlapping clusters, it appears to be well approximated by a single normal density. 
Table 22.4, showing posterior summaries of the model parameters, reveals that the 
posterior concentrates on one component centered around zero. 

For the acidity data, the bottom row of Figure 22.4 suggests that two clusters are 
involved. Since one of them appears to be skewed, we expect that three components 
might be needed to approximate this density well. This illustrates that if the form of 
the component does not match well the form of the clusters, the number of components 
cannot be interpreted as the number of clusters. Indeed, as shown in Table 22.4, the 
posterior provides evidence in favor of two or three components. The fitted density 
is also shown in the bottom row of Figure 22.4 along with pointwise 95% credible 
intervals, which can be easily calculated based on the Gibbs sampling output. Based 
on comparing with a usual frequentist histogram, the Bayesian density estimate based 
on the finite mixture model seems to perform quite well. 

As the iris data are four-dimensional, they cannot be presented with a simple density 
estimate, but we can estimate the clusters. Table 22.4 shows posterior summaries of 
the parameter estimates after adjustment for label switching. It is interesting to see 
if the clusters we obtain agree with the three species in the data. One advantage of 
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Mixure component, h 1 2 3 4 5 6 
Weight, mh 0.30 0.23 0.16 0.14 0.11 0.07 

(0.05 (0.04) (0.03) (0.04) (0.03) (0.02) 
Location, Hin —0.93 —0.47 0.29 —1.37 —0.44 1.27 
(0.06 (0.15 (0.09) (0.06) (0.14) (0.12) 
Location, [an 1.01 2.10 —0.45 0.08 —1.43 0.15 
(0.08 (0.17 (0.10 (0.07) (0.15) (0.11) 
Location, push =1:28. =1.27 0.53 —1.32 0.03 1.09 
(0.05 (0.14) (0.07) (0.06) (0.12) (0.10) 
Location, [an —1.22 —1.19 0.48 —1.31 —0.04 1.14 
(0.06 (0.13 (0.08) (0.06) (0.12) (0.10) 
Scale, on 0.07 0.14 0.16 0.07 0.24 0.27 
(0.01 (0.04) (0.02) (0.01) (0.04) (0.03) 


Table 22.4 Posterior mean and standard deviation of weight, location, and scale parameters for 
the six mixture components fit to the iris data, in which each data point is characterized by four 
continuous predictors. 


the Bayesian approach is that the posterior simulations automatically represent uncer- 
tainty in the number of clusters. This can be summarized by ordering the components 
by their probability weights. Since the data contain three true biological clusters, 
we may expect the posterior to concentrate on three components. But the posterior 
means for the masses of later clusters are nontrivial, suggesting that the fitted normal 
distributions do not fully capture the variation within one or more of the species. 


22.5 Mixture models for classification and regression 


Mixture models can be useful in classification and regression if the target and predictor 
values are partially observed. 


Classification 


Consider first the classification problem with the goal of predicting an unordered categorical 
response variable y € {1,...,C} from a vector of real-valued predictors x = (a1,..., Zp). 
In fully supervised classification problems, both y; and x; are observed for every item 
in a training sample of n items, while in semi-supervised classification x; is observed for 
i= 1,...,n but y; is observed only for a subset of no <n labeled items. 

Instead of modeling the class probability directly, we can model the marginal likelihood 
of y; and the conditional likelihood of x; given y; to induce a model on the classification 
function using Bayes’ rule: 


Pr(y;=clz;) = _ Pryi=Of(eilyi=o) 
Nour Prvi=e)f (eile) 


Let Ye = Pr(y; = c) denote the marginal probability of falling into response category c, 
and let fe(xi) = f(xily; = c) denote the conditional likelihood of the predictors x; for 
items within response category c. If we are sufficiently flexible in modeling { f.}¢_,, we can 
accurately characterize any classification function in this manner, allowing nonlinearities, 
higher order interactions, and so forth. The resulting expression is analogous to the cluster 
allocation formula for a C-component finite mixture model. 

In Bayesian discriminant analysis, we specify a prior for the marginal category proba- 
bilities Y% = (%1,..., Yc) and for the collection of category-specific conditional likelihoods 
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{f-}&. A simple conjugate prior for ~ is 
Y ~ Dirichlet (ayo1, a02,- - - , avoc), 


where Yo = E(w) is the prior mean and a is a prior sample size controlling prior uncertainty. 
Updating with the likelihood for the training sample of n items in the fully supervised case, 
we obtain 


(wly, X) = Dirichlet (evn +> lysis REA apoc +) Inc) : 
w=1 


i=l 


It remains to specify a model for {f.}¢_,. One possibility is to let f.(xi;) = N(ai|Uc, X) 
so that the predictors within each response class are modeled as normal with a common 
variance but with the mean differing across classes. Such a parametric choice may work 
well in certain applications, but results may be sensitive to the assumption of normality, 
particularly in the semi-supervised case and when differences between classes in f.(a;) are 
more subtle than simply a shift in the mean. As an effectively nonparametric alternative, 
we can instead assume a finite mixture of multivariate Gaussians with 


H 
feli) = XO ten Nes (Mens Xen) ~ Po, 
h=1 
where Te = (Tei, ..-, Tek) are mixture weights specific to class c and (w%,, U%,) are means 


and covariances for the different mixture components specific to class c. 

This specification is richly parameterized, so for simplicity we recommend letting Te = 
T = (m1,...,7) which uses a single set of mixture weights across the response categories, 
with a Dirichlet( 4, Pe 7) prior distribution on m. To complete the prior specification, 
we can choose a conditionally conjugate normal-inverse Wishart prior P for the mean and 
covariance specific to each component. 

Letting z; ~ Np(ui, Xi) we have ui = p} z and Xi = UY, ,, with y; denoting the response 
category for item i, and z; € {1,...,H} denoting the mixture component index within 
that category. Posterior computation can proceed via a straightforward Gibbs sampling 
algorithm. The first step updates the mixture component index z; from the multinomial 
full conditional posterior, 


Ny (a;|*, , o* 
Pr(z,;=h|y;=c, —) = sa enon) plaid Men - ch) —: 
X= TH Np(@il Mens Ein) 


We then independently update w from its Dirichlet conditional, m from its Dirichlet condi- 
tional, and each (u%),, ©£žņ„) from their multivariate normal-inverse Wishart full conditionals. 
This final step can be adapted to update (y%,, *,) under alternative priors as well. This 
algorithm can easily accommodate the fully supervised or semi-supervised cases. In the 
semi-supervised case, we simply include an additional Gibbs sampling step for imputing the 
missing y; for the unlabeled items from the multinomial full conditional posterior. We can 
easily accommodate missing predictors in this framework as well through including a step 
for updating the missing elements of x; from their conditional normal posterior given the 
observed elements and current values for the other unknowns. 

In order to predict yn+1 for a future item based on their predictors x41, we would 
simply run the above Gibbs sampler and record the predictive probability of ynii = c 
for each choice of c at each saved iteration. The average across the posterior simulations 
yields an estimate of Pr(yn41 =cl|an+1,y,X), for c = 1,...,C. These estimates can be 
reported to accommodate uncertainty in prediction or a single Bayes optimal prediction can 
be reported by choosing the value of c that minimizes the expected loss. If one chooses a 
0-1 loss function, which gives a loss of 1 for assigning an incorrect yn+1 regardless of the 
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type of error, then the optimal predictive value will correspond to the c that leads to the 
highest Pr(yn41=cl@n41, Y, X). 

To generalize the above methods to accommodate categorical predictors, we suppose 
that x; consists of predictors having a variety of measurement scales and supports. Ini- 
tially putting aside the classification problem, suppose we want to model the joint density 
of x; flexibly, where x; comprises a mixture of continuous and categorical predictors. Let 
f(z:l0:) = = K;(«:j|9i;), with Kj a kernel appropriate to the support of the jth pre- 
dictor. This product kernel model assumes conditional independence given 0;. To induce 
dependence, we again rely on the mixture idea to assign a joint prior distribution on the 
full complement of parameters 6;, 


H 
bi ~ P= X téde,, On = (On1,---; Onp), 


h=1 


where de denotes a degenerate distribution with all its mass concentrated at © so that 0; = 
Op with probability Ya. We let Orj ~ Po; independently, with P;; a prior appropriate for 
the jth kernel. As some simple examples, we may consider Gaussian kernels for continuous 
predictors, Bernoulli kernels for binary predictors and Poisson kernels for count predictors, 
with the corresponding Po;’s chosen to be conjugate. 

The above product kernel mixture is also a simple latent class model. Let z;=h denote 
the latent class for item 7, with Pr(.$;=h) = ma. Individuals in latent class (subpopulation) h 
have distinct parameters within each kernel. Elements of x; are conditionally independent 
given z; but are marginally dependent. We can use the product kernel approach in the 
classification setting to jointly model the distribution of y; and x;. Initially suppose that 
xi E RP as before. Conditionally on z;, we define independent likelihoods with 


Pr(yg=clz;=h) = pze, f(xilzi=h) = N(wilMz;, Ez). 


We have a multinomial likelihood for y; and independently have a multivariate normal like- 
lihood for x; with class-specific parameters. This approach is simpler than the discriminant 
analysis approach in avoiding need to explicitly model the conditional of x; given yi = c. 
Instead we just specify coupled finite mixture models for y; and for x;. Computation is 
also straightforward using a minor modification of the Gibbs sampling algorithm described 
above in the discriminant analysis case. In averaging over 2;, we induce 


H 
-1 ThWheN (wil fn, En 

Pr(y;=clx;) = h= THN (sl dns Yn) 
pai TAN (zilun, Un) 


which can be shown to be extremely flexible. We can also do the same for the elements of 
xi; for example, specifying separate multinomials for categorical predictors and normals for 
continuous predictors. 


Regression 


Suppose that y; € R and x; € RP and our goal is to build a predictive model for yn+1ı 
given 2,41 for future items based on a training sample of n items providing data (yi, £i), 
for i =1,...,n. One approach is to simply let w; = (yi, xi) and then define a finite mixture 
of multivariate normals for the joint density of w; as follows: 


A 
f(wi) = X tmNp+1(wilun, En), (22.13) 


h=1 
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which will automatically induce a flexible model for the conditional density of y; given x; 


f (yi|x:) = Som Zi) N(yil bon +2: Pin; Th) (22.14) 


where the predictor-dependent weights are defined as 


trNp(ai| pe”, Eo) 


h? 
H x 
To awa =) 


Expression (22.14) characterizes the conditional density of the response given predic- 
tors as a predictor-dependent finite mixture of linear regression models, with each linear 
regression having a separate intercept, slopes and residual variance. At a particular pre- 
dictor value x; = x, the conditional density of y; will follow a univariate finite mixture of 
Gaussians form as described in the earlier sections of this chapter. To allow the density 
to evolve smoothly as the predictor values change, the mean in the normal kernel follows 
a linear regression model and the weights on the different regression models are allowed to 
change depending on the predictors. Not only is the mean allowed to change nonlinearly 
with the different predictors but the variance and shape of the residual density can also 
change flexibly. 

An appealing feature of model (22.14) is that it can be easily fit by implementing a 
Gibbs sampler for the simple joint model in (22.13) following the approach outlined earlier, 
with posterior draws for the conditional density (22.14) induced automatically from draws 
from the joint. 

There are some issues that motivate models for the conditional density that do not 
require modeling of the marginal density of the predictors. Firstly, one or more of the pre- 
dictors may be fixed by design and hence it may be unnatural to consider the predictors as 
random variables even if this is done primarily to induce a model on the conditional density 
of y; given x; =x. Secondly, the predictors may not all be continuous making it more compli- 
cated to define a finite mixture model for the joint density. Third, in applications involving 
moderate to large numbers of predictors, it can be expensive to conduct posterior compu- 
tation for the marginal f(x;), with this expense difficult to justify when interest focuses 
entirely on the conditional of y; given x;. Finally, in some cases, a simple one-component 
model could be sufficient for the conditional model y; given xi, but joint modeling of y and 
x may require complex multicomponent model, which can lead to degraded performance in 
predicting y; given zi. 


Taxi) = (22.15) 


22.6 Bibliographic note 


Application of EM to mixture models is described in Dempster, Laird, and Rubin (1977). 
Bishop (2006) reviews the closely related variational Bayes implementation, and Minka 
(2001) presents expectation propagation for mixture models. 

Gelman and King (1990b) fit a hierarchical mixture model using the Gibbs sampler 
in an analysis of elections, using an informative prior distribution to identify the mixture 
components separately. Other Bayesian applications of mixture models include Box and 
Tiao (1968), Turner and West (1993), and Belin and Rubin (1995b). Richardson and Green 
(1997) and Stephens (2000a) discuss Bayesian analysis of mixtures with unspecified numbers 
of components. 

A comprehensive text emphasizing non-Bayesian approaches to finite mixture models is 
Titterington, Smith, and Makov (1985). West (1992) provides a brief review from a Bayesian 
perspective. Muller and Rosner (1997) present an application of a Bayesian hierarchical 
mixture model. 
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The schizophrenia example is discussed more completely in Belin and Rubin (1990, 
1995a) and Gelman and Rubin (1992b). An expanded model applied to more complex data 
from schizophrenics appears in Rubin and Wu (1997). Rubin and Stern (1994) and Gelman, 
Meng, and Stern (1996) demonstrate the use of posterior predictive checks to determine the 
number of mixture components required for an accurate model fit in a different example in 
psychology. 

Relabeling algorithms for the label switching problem in finite mixture models have 
been proposed by Stephens (2000b) and Jasra, Holmes, and Stephens (2005), relying on 
postprocessing after running an initial MCMC analysis ignoring the label-switching issue. 
If one wishes to improve mixing of the labels, one possibility is to incorporate label-switching 
moves within the MCMC as proposed in Papaspiliopoulos and Roberts (2008). 

Ishwaran and Zarepour (2002) justify the Dirichlet(Ẹ,..., $) prior on the mixture 
weights in an overfitted mixture model having H as an upper bound on the number of 
occupied mixture components. Rousseau and Mengersen (2011) provide asymptotic results 
showing the zeroing out of redundant components in overfitted mixture models. 

Some influential references on Bayesian modeling and computation for finite mixture 
models include Diebolt and Robert (1994) and Richardson and Green (1997). Fraley and 
Raftery (2002) provide an overview of the use of finite mixture models for clustering, discrim- 
inant analysis and density estimation. Dunson (2010a) reviews Bayesian mixture models 
for conditional densities and considers a motivating epidemiology application. Dunson and 
Bhattacharya (2010) propose the joint modeling product kernel approach for classification 
and regression and provide a theoretical justification. 


22.7 Exercises 


1. Posterior summaries: Suppose you have a mixture model with three components. For 
each data point you want to identify which of the three components it comes from. 
It would be best to use the full posterior distribution but you need a point estimate. 
Which of the following would you prefer: the posterior mean, the posterior median, or 
the posterior mode? Assume each of these is done pointwise (that is, you are getting the 
marginal mean, median, or mode of the latent component for each data point, not the 
joint mean, median, or mode for all the data points at once). 


2. Mixtures with unspecified numbers of components: Simulate 500 data points from an 
equally weighted mixture of three normal distributions centered at —2, 0, and 2, each 
with scale parameter 1. 


(a) Fit a Bayesian model of a mixture of two normal distributions to these data. 
(b 


) Fit a Bayesian model of a mixture of three normal distributions to these data. 
(c) Fit a Bayesian model of a mixture of four normal distributions to these data. 
) 


(d) Fit a Bayesian model with unspecified number of mixture components, with the total 
number of components being allowed to be anywhere between 1 and 6. 


3. Fitting long-tailed data with a normal mixture: Repeat the above problem, but simulat- 
ing the data from a mixture of three t4 distributions. Again fit the data using mixtures 
of normals. 


4. Specify a finite mixture of Gaussians model for the density of the galaxy data. Assume a 
symmetric Dirichlet prior with parameter a for the weights on the different components, 
and normal inverse-gamma priors for the location and variance of each Gaussian kernel. 
Plot the Bayesian density estimate under squared error loss along with a simple histogram 
of the data, and comment on how the density estimate changes as (a) the parameter a 
decreases to zero, (b) the number of mixture components k increases, or (c) the variance 
of the normal inverse-gamma prior increases. 
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. Consider the football point spread data of Section 1.6. Instead of assuming that the 
differences between score differential and point spread follow a normal distribution, fit a 
finite mixture of Gaussians to these data using a symmetric Dirichlet prior with hyper- 
parameter 1/k, where k is the number of mixture components. Run a Gibbs sampler to 
analyze the data, and compare the fitted distribution with that for the normal model. 
Comment on whether the results suggest the Gaussian density provides a good approxi- 
mation. 

. Consider the kidney cancer example of Section 2.7. Assume yj ~ Poisson(10n,6;) 
with 6; ~ D Thoz, OF ~ Gamma(a, 8), a = 20, 6 = 430,000, k = 25 and 7 = 
(Ti,... Tk) ~ Dirichlet(1/k,...,1/k). Comment on how this model differs from 0; ~ 
Gamma(a, 3). Fit both these models and compare the results. 


. What problems (if any) result from putting a noninformative prior on component-specific 
parameters in a finite mixture model? 

. Draw 1000 samples from 7 = (m1, ..., mk) ~ Dirichlet(1/k,...,1/k) for k = 5, 10, 25, 50, 
100, 1000. For each sample, reorder the elements of m to be decreasing from largest to 
smallest and estimate posterior summaries of these order statistics. Plot the results and 
describe the behavior as k increases. Repeat this exercise for m ~ Dirichlet(1,...,1). 
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Dirichlet process models 


The Dirichlet process is a infinite-dimensional generalization of the Dirichlet distribution 
that can be used to set a prior on unknown distributions. Furthermore these unknown 
densities can be used to extend finite component mixture models to infinite component 
mixture models. 


23.1 Bayesian histograms 


We start in this section and the next with the relatively simple setting in which y; Es f 


and the goal is to obtain a Bayes estimate of the density f. The histogram is often (also in 
this book) used as a simple form of density estimate. In this section we develop a flexible 
parametric version of the histogram that helps to motivate the fully nonparametric Bayesian 
density estimation of the following section. The remainder of the chapter shows how the 
Dirichlet process can be applied beyond density estimation. 

Assume we have prespecified knots € = (£o, €1, . . . , Ep) to define our histogram estimate, 
with o < 1 <-++ < &-1 < & and yi € [,&]. A probability model for the density that 
is analogous to the histogram is as follows: 


or a TE — a (En 2 D y E R, 


with m = (m,...,7%) an unknown probability vector. We complete a Bayes specification 
with a prior distribution for the probabilities. Assume a Dirichlet(a1,...,a,) prior distri- 
bution for 7, 


i Yat) 
DaT) Aci 


The hyperparameter vector can be re-expressed as a = a7g, where 


ay, Ak 
E(z|a) = To = (=. ony =i 
hn h 


is the prior mean and a is a scale that is often interpreted as a prior sample size. 
The posterior distribution of 7 is then calculated as 


p(mly) x Il a TI oe ee 
h=1 tyi€(En—1,€n] En — Sh- 
k 
x II Tce 2 Dirichlet (a1 +n1,..., ak +ng), 
h=1 


ap—1 
h 


p(r|a) = 


545 
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where np = 0; le, _,<y;<e, iS the number of observations falling in the hth histogram bin. 
To illustrate the Bayes histogram method, we simulated data from the mixture, 


f(y) = 0.75 Beta(y|1, 5) + 0.25 Beta(y|20, 2), 


with m = 100 samples drawn from this density. Assuming data between [0,1] and choosing 
10 equally-spaced knots, we applied the Bayes histogram approach and plotted the true 
density and simulations from the posterior distribution of the histogram obtained from this 
procedure. 

The Bayes histogram estimator does an adequate job approximating the true density, 
but the results are sensitive to the number and locations of knots. However, an appealing 
property of the Bayes histogram approach is that implementation is easy since we have con- 
jugacy and the posterior can be calculated analytically. In addition, the approach allows 
prior information to be included and allows easy production of interval estimates, and hence 
has some practical advantages over classical histogram estimators. To improve performance 
a prior can be placed on the numbers and locations of knots, with reversible jump MCMC 
(see Section 12.3) used for computation, but such an approach is computationally demand- 
ing. In addition, even averaging over random knots will tend to introduce artifactual bumps 
in the density estimate. The Dirichlet prior distribution is perhaps not the best choice due 
to the lack of smoothing across adjacent bins, but it does have the advantage of conjugacy 
and simplicity in interpretation of the hyperparameters. 


23.2 Dirichlet process prior distributions 
Definition and basic properties 


Motivated by the simplicity of the Bayesian histogram approach with a Dirichlet prior, one 
wonders whether we can somehow bypass the need to explicitly specify bins. This would 
also facilitate extensions to multivariate cases in which there is an explosion of the number 
of bins that would be needed. With this thought in mind, suppose the sample space is 
Q, partitioned into measurable subsets B1,..., Bk. If Q = R, then By,..., Bpk are simply 
non-overlapping intervals partitioning the real line into a finite number of bins. 

Let P denote the unknown probability measure over (Q, B), with B the collection of all 
possible subsets of the sample space 2. The probability measure will assign probabilities to 
these subsets (bins), with the probabilities allocated to a set of bins By,..., Bẹ partitioning 
Q being 


P(Bi),...,P(Br) = ( of Was f sw)au) : 


If P is a random probability measure (RPM), then these bin probabilities are random 
variables. A simple conjugate prior for the bin probabilities corresponds to the Dirichlet 
distribution. For example, we could let 


P(B,),...,P(Br) ~ Dirichlet(a@Po(B1),...,aPo(Be)), (23.1) 


where Po is a base probability measure providing an initial guess at P, and a is a prior 
concentration parameter controlling the degree of shrinkage of P toward Po. 

Prior (23.1) is essentially a Bayesian histogram model closely related to that described 
in the previous section. However, the difference is that (23.1) only specifies that bin By, is 
assigned probability P(B;) and does not specify how probability mass is distributed across 
the bin By. Hence, for a fixed partition B),..., Bk, (23.1) does not induce a fully specified 
prior for the random probability measure P. The idea is to eliminate sensitivity to the 
choice of partition B,,...,B, and induce a fully specified prior on P through assuming 
(23.1) holds for all possible partitions By,...,B, and all k. 


This electronic edition is for non-commercial purposes only. 


23.2. DIRICHLET PROCESS PRIOR DISTRIBUTIONS 547 


For this specification to be coherent, there must exist a random probability measure 
P such that the probabilities assigned to any measurable partition B,,...,B, by P is 
Dirichlet(a@Po(B1),...,@Po(By)). The existence of such a P can be shown by verifying 
certain consistency conditions, and the resulting random probability measure P is referred 
to as a Dirichlet process. Then, as a concise notation to indicate that a probability measure 
P on (Q, 8) is assigned a Dirichlet process (DP) prior, let P ~ DP(aPo), where a > 0 is 
a scalar precision parameter and P) is a baseline probability measure also on (Q, 8). This 
baseline Pp is commonly chosen to correspond to a parametric model such as a Gaussian. 

The definition of the Dirichlet process and properties of the Dirichlet distribution imply, 


P(B) ~ Beta(aPo(B),a(1— Po(B))), for all B € B, 


so that the marginal random probability assigned to any subset B of the support is simply 
beta distributed. It follows directly that the prior mean has the form 


E(P(B)) = P(B), forall BEB, 
so that the prior for P is centered on Po. In addition, the prior variance is 


P(B)U - Po(B)) 


var(P(B)) = F 


, forall BEB, 
so that a is a precision parameter controlling the variance. 

Hence, the Dirichlet process is appealing in having a simple specification arising from 
a model similar to a random histogram but without the dependence on the bins, while 
also having simple and intuitive forms for the prior mean and variance. The prior can 
be centered on a parametric model for the distribution of the data through the choice of 
Po, while allowing a to control uncertainty in this choice. Moreover, it can be shown that 
the support of the DP contains all probability measures whose support is contained in the 
support of the baseline probability measure Pp. 

The DP prior distribution also has a conjugacy property which makes inferences straight- 
forward. To demonstrate this, first let y; Eo P,fori=1,...,n and P ~ DP(aPp), where we 
follow common convention in using P to denote both the probability measure and its cor- 
responding distribution. Then, from (23.1) and conjugacy properties of the finite Dirichlet 
distribution, for any measurable partition B,,...,B,, we have 


P(B,),...,P(Br) | Y1,- -, Yn ~ Dirichlet (ereny lyeBi+--,@Po(Be)+>_ Ineo 


i=l i=l 


From this and the above development, it is straightforward to obtain 


P| yi, ..., Yn ~ DP (on +o), 


The updated precision parameter is œ + n, so that œ is in some sense a prior sample size. 
The posterior expectation of P is defined as 


E(P(B) |y”) = (= z -) PAB) $ € 7 -) s Lay (B). (23.2) 


i=l 


Hence, the Bayes estimator of P under squared error loss is the empirical measure with 
equal masses at the data points shrunk toward the base measure. It is clear that as the 
sample size increases, the Bayesian estimate of the distribution function under the Dirichlet 
process prior will converge to the empirical distribution function. 


This electronic edition is for non-commercial purposes only. 


548 23. DIRICHLET PROCESS MODELS 


In addition, in the limit as the precision parameter œ approaches 0, so that we in some 
sense have a noninformative prior distribution, the posterior distribution is 


P|y” ~ DP a 


j=l 


This limiting posterior distribution is sometimes known as the Bayesian bootstrap. Samples 
from the Bayesian bootstrap correspond to discrete distributions supported at the observed 
data points with Dirichlet distributed weights. Compared with the classical bootstrap, the 
Bayesian bootstrap leads to smoothing of the weights. 

Even with these many appealing properties, the Dirichlet process prior distribution has 
some important drawbacks. Firstly, there is a lack of smoothness apparent in (23.2). Ideally, 
one would not simply take a weighted average of the base measure and the empirical measure 
with masses at the observed data points, but would allow smooth deviations from the base 
measure. Smoothness would imply dependence between P(B,) and P(B2) for adjacent bins 
Bı and Bə. However, the DP actually induces negative correlation between P(B1) and 
P(B2) for any two disjoint sets Bı and B2, with no account for the distance between these 
sets. An even more important concern for density estimation is that realizations from the 
DP are discrete distributions. Hence, P ~ DP(aPo) implies that P will be atomic having 
nonzero weights only on a set of atoms and will not have a continuous density on the real 
line. 

Despite these drawbacks the DP has been useful in developing flexible models for a 
wide variety of problems. Before demonstrating some of the applications we introduce an 
alternative characterization of the Dirichlet process. 


Stick-breaking construction 


The above specification of the Dirichlet process does not provide an intuition for what 
realizations P ~ DP(aPo) actually look like, since the DP prior was defined indirectly 
through the marginal probabilities allocated to finite partitions. However, there is a direct 
constructive representation of the Dirichlet process, which is referred to as the stick-breaking 
representation, which is useful in obtaining further insight into properties of the DP and as 
a stepping stone for generalizations. 

The stick-breaking representation allows us to induce P ~ DP (aP) by letting 


PO) =X andon C), tn = Vn [ [C - V), Va ~ Beta(1,a), 0, ~ Po, (23.3) 
h=1 l<h 


where 59 denotes a degenerate distribution with all its mass at 0, the atoms (0;,)?<, are 
generated independently from the base distribution Po, ma is the probability mass at atom 
0n, and these probability masses are generated from a so-called stick-breaking process that 
guarantees that the weights sum to 1. 

To describe the stick-breaking process, we start with a stick of unit length representing 
the total probability to be allocated to all the atoms. We initially break off a random piece 
of length Vı, with the length generated from a Beta(1,a) distribution, and allocate this 
Tı = Vı probability weight to the randomly generated first atom 6; ~ Po. There is then 
1— V, of the stick remaining to be allocated to the other atoms. We break off a proportion 
Və ~ Beta(1,a) of the 1 — Vı stick and allocate the probability 72 = V2(1 — Vi) to the 
second atom 62 ~ Py. As we proceed, the stick gets shorter and shorter so that the lengths 
allocated to the higher indexed atoms decrease stochastically, with the rate of decrease 


depending on the DP precision parameter a. Because E(V;,) = 7+, values of a close to 
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Figure 23.1 Samples from the stick-breaking representation of the Dirichlet process with different 
settings of the precision parameter a. 


zero lead to high weight on the first couple atoms, with the remaining atoms being assigned 
small probabilities. 

Figure 23.1 shows realizations of the stick-breaking process for Py corresponding to a 
standard normal distribution and different values of a. It is apparent from this figure, that 
the DP is not appropriate as a direct prior on the distribution of the data, particularly if 
the data are continuous. For continuous data, each new subject requires a new atom so that 
a large value of a is required, implying weight close to zero on each atom and hence a small 
probability of ties in the realizations from P. In the limit as a > ov, one obtains y; ~ Po, 
and hence for large a and no ties in the observations, the DP prior effectively models the 
data are drawn from the parametric base distribution. 


23.3 Dirichlet process mixtures 
Specification and Polya urns 


The failure of the DP prior distribution as a direct model for the distribution of the data does 
not imply that it is not useful in applications. Instead, the DP is more appropriately used 
as a prior for an unknown mixture distribution. Focusing again on the density estimation 
case for simplicity, a general kernel mixture model for the density can be specified as 


flP) = / K(yl9)aP(0), (23.4) 


where K(-|0) is a kernel, with @ including location and possibly scale parameters, and P is 
a mixing measure. In the special case in which P is treated as discrete with masses at a 
finite number of k atoms, one obtains a finite mixture model as discussed in Chapter 22. 
In a infinite kernel mixture model, one chooses a prior P ~ mp for the unknown mixing 
measure P, where P denotes the space of all probability measures on (Q, B) and mp denotes 
a prior over this space. The prior for the mixing measure induces a prior on the density 
f(y) through the integral mapping in (23.4). If mp is chosen to correspond to a DP prior, 
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then one obtains a DP mixture model. From (23.3) and (23.4), a DP prior on P leads to 


f(y) = X mK (yl), (23.5) 
A= 


where m =~ stick(a) is shorthand notation to denote that the probability weights are sam- 
pled from a DP stick-breaking process with parameter a, and with 6, ~ Py) independently 
for h = 1,...,00. 

Expression (23.5) resembles the finite mixture models considered in Chapter 22, but with 
the important difference that the number of mixture components (latent subpopulations) is 
set to infinity. However, this does not imply that infinitely many components are occupied 
by the subjects in the sample; rather the model allows flexibility by introducing new mixture 
components as subjects are added. Consider the hierarchical specification in which 


yi ~ K(6;), 0,~ P, P ~ DP (aP). 


This formulation is equivalent to sampling y; from the infinite mixture model in (23.5). A 
key question is how to conduct posterior computation under this DP mixture (DPM)? This 
initially seems problematic in that the mixing measure P is characterized by infinitely many 
parameters, as is apparent in (23.3), and we no longer have joint conjugacy in which the 
posterior of P given y” = (y1,---,Yn) has a simple form. 

A clever way around this problem is to marginalize out P to obtain an induced prior dis- 
tribution on the subject-specific parameters 0” = (01,...,0n). In particular, marginalizing 
out P, we obtain the Polya urn predictive rule, 


i-1 
1 
p(:|91,..-,A:-1) ~ (+) rie) + 2 (a)r (23.6) 


This conditional prior distribution consists of a mixture of the base measure P) and prob- 
ability masses at the previous subject’s parameter values. 

A Chinese restaurant process metaphor is commonly used in describing the Polya urn 
scheme. Consider a restaurant with infinitely many tables. The first customer sits at a 
table with dish 07. The second customer sits at the first table with probability Te or a 
new table with probability ia! This process continues with the ith customer sitting at 
an occupied table with probability — where c; is the number of previous customers 
at that table, and sitting at a new table with probability ——4~. Each occupied table in 
the (infinite) restaurant represents a different cluster of subjects, with new clusters added 
at a rate proportional to alogn in the asymptotic limit. The number of clusters depends 
(probabilistically) on the number of subjects n with new clusters introduced as needed as 
additional subjects are added to the sample. This makes more sense in typical applications 
than finite mixture models in which k does not depend on n and can be thought of as 
a formal procedure mimicking the good practice, when fitting a finite mixture model, of 
manually adding new mixture components as necessary to fit the data. 

The simple form for the conditional distribution in (23.6) leads to a useful idea for 
posterior computation and prediction. From exchangeability of the subjects 7 = 1,...,n, 
one can obtain the conditional prior distribution for 6; given 0(_;) = (9;,7 #7) as 


A k9 no 
0;|0_; ~ | ——— ] Po(6: —1@§[ |ð oi, 23.7 
| (==) o( os =, on eet) 


where 6*,h = 1,...,k(—9, are the unique values of 0®, and nk) = ji 10;=0;- 
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Updating the full conditional prior (23.7) with the data, one obtains a conditional pos- 
terior distribution for 0; having the same form but with updated weights on the components 
and updated Po, as long as P is conjugate to the kernel K. For example, this occurs when 
K(-|@) is a normal kernel, with 8 = (u, ¢) the mean and precision and P) a normal-gamma 
prior distribution. Potentially, one can update the 6;’s one at a time from these full con- 
ditional posterior distributions in implementing Gibbs sampling. However, this approach 
tends to have poor mixing. 

An alternative marginal Gibbs sampler, which instead separately updates the allocation 
of subjects to clusters and the cluster-specific parameters, proceeds as follows. Let 0* = 
(O;,...,0%) denote the unique values of 0 and let S; = c if 6; = 0% so that S; denotes 
allocation of subject ¿į to a cluster. The Gibbs sampler alternates between 


1. Update the allocation S by sampling from the multinomial conditional posterior with 


(~i) * —i 
K(yi|@%)  c=1,...,kO-9 

Pr($;=c|—) x nn c aa 
HARAS { a fK(y|0)dPo(0) c=k +1 


If S; = k’ +1, then subject i is allocated to a singleton cluster. 
2. Update the unique values 6* by sampling from 


p(B | —) « P(O) T] Kilo, 


i:S;=c 


which is simply the posterior distribution under the parametric model that assigns prior 
distribution Po to the parameters 6; and then updates this prior with the likelihood for 
those subjects in cluster h. 


When Py is conjugate to the kernel K, the integral in step 1 can be calculated analytically 
and the conditional posterior in step 2 has the same parametric form as Po except with 
updated parameters. For example, when the kernel is Gaussian, with 0 the mean and 
variance and Po a conjugate normal-inverse-gamma prior, the conditional distribution of 
0* has the same normal-inverse-gamma form described in Chapter 22 in the finite mixture 
case. There are modifications available to accommodate nonconjugate cases as well. 

In step 1 of the above Gibbs sampler, either the ¿ith subject is allocated to an existing 
cluster occupied by one of the other subjects in the sample or the subject is allocated 
to a new cluster. The conditional posterior probability of allocation to a new cluster is 
proportional to the DP precision parameter a multiplied by the marginal likelihood for the 
ith subject’s data, obtained in integrating the likelihood K(y;|@) over the prior 0 ~ Pp. If 
a is close to zero or this marginal likelihood is small relative to the likelihoods for the ith 
subject’s data given allocation to one of the occupied clusters, then subject i will tend to be 
allocated to an existing cluster. Hence, both a and Po play important roles in controlling 
the posterior distribution over clusterings. As a decreases, there is an increasing tendency 
to cluster subjects, with a parametric model y; ~ K(@) for a common @ obtained in the 
limit as a — 0. In practice, it is common to either set a = 1 to favor allocation to few 
clusters or to choose a gamma hyperprior for a to allow greater data-adaptivity, with an 
additional MCMC step included to update a. 

Somewhat more subtle, and often overlooked, is the role of Po in controlling clustering 
behavior. One may naively try a high variance P) to express ignorance about the prior 
distribution of likely locations for the different kernels. However, similarly as discussed 
in Section 7.4, a flat prior for Py can turn out to make strong assumptions, in this case 
effectively placing a heavy penalty on the introduction of new clusters. This is because 
as the variance of Py becomes high, the marginal likelihood will decrease, since the prior 
P) places small probability in a region of plausible 0 values in such cases. In the limit as 
the variance of Py — oo, the posterior will behave as if the likelihood is y; ~ K(@) with 
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a common @ for all individuals. In practice, we recommended constructing an informative 
Po placing high probability on introducing clusters near the support of the data; this can 
be facilitated by standardizing the data in advance of the analysis. Refer to the relevant 
discussion in Chapter 22. 

This Gibbs sampler for Dirichlet process models closely resembles the Gibbs sampler for 
finite mixtures, with the main difference being that we marginalize out the weights 7 on the 
different clusters and allow the number of clusters to vary across the samples. The number 
of mixture components k represented in the sample of n subjects is treated as unknown, and 
we obtain samples from the posterior of k automatically without needing a reversible jump 
MCMC algorithm. From the Gibbs samples, we can also estimate the predictive density of 


Yn+1 using : 
vu) => (2) ele) + (2) Konano, 


c=] 


averaged over posterior simulations. The simplicity of this Gibbs sampler and the ability 
to bypass the issue of selection of k by embedding in an infinite mixture model, which 
automatically introduces new components at a slow rate as needed when additional subjects 
are added to the sample, are major reasons for the large applied success of Dirichlet process 
mixture models. 

The Gibbs sampler for finite mixture models introduced in Chapter 22 provides an ap- 
proximation to a DP mixture model with P ~ DP(aPp) as long as the mixture component- 
specific parameters are drawn iid a priori from Po and the prior on the weights is m ~ 
Dirichlet(%,...,%). The approximation improves with k and in practice one can set k 
equal to a conservative upper bound on the number of occupied clusters in the sample 
(k = 25 or 50 can work well). Indeed, we refer the reader to the discussions in Chapter 22 
pertaining to the issues that arise in finite mixture modeling, as essentially the same issues 
arise in infinite discrete mixtures, such as DPMs, and the same solutions apply. 


Blocked Gibbs sampler 


By marginalizing out the random probability measure P, we give up the ability to conduct 
inferences on P. By having approaches that avoid marginalization, we open the door to 
generalizations of DPMs for which marginalization is not possible analytically. One ap- 
proach for avoiding marginalization is to rely on the construction in (23.5). Because the 
stick-breaking construction orders the mixture components so that the weights are stochas- 
tically decreasing in the index A, for a sufficiently high index N, we will have that J ® 41 Th 
has a distribution concentrated near zero. Hence, we can obtain an accurate approximation 
by letting Vy = 1 in the stick-breaking process so that mp = 0 for h = N+1,...,00, with N 
chosen to be sufficiently large. In practice, N = 25 or 50 is commonly chosen as a default, 
with N providing an upper bound on the number of clusters in the n subjects in the sample. 
We have rarely seen a need for more than 10 or 15 clusters to accurately fit the unknown 
density. 

The truncation approximation to the DP leads to a straightforward MCMC algorithm 
for posterior computation, and represents an alternative to the finite Dirichlet approxima- 
tion described in Chapter 22. It is not clear which of these approaches leads to more efficient 
posterior computation, though the exchangeability of the components in the finite Dirich- 
let approximation conveys some advantages in terms of mixing. Using the stick-breaking 
truncation, the following blocked Gibbs sampler can be used: 


1. Update S; € {1,..., N} by multinomial sampling with 
TeK(yilOe) 


Pr($;=c¢|-) = =“ + _, 
ai Te K(y;|9%,) 
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where S; = c if 0; = 0% denotes that subject 7 is allocated to cluster c. 
2. Update the stick-breaking weight V., c = 1,..., N—1, from Beta( 14n, ASD ne) ; 


3. Update 0%, c = 1,..., N, exactly as in the finite mixture model, with the parameters for 
unoccupied clusters with ne = 0 sampled from the prior Po. 


This algorithm involves simple sampling steps and is straightforward to implement. In 
order to estimate the density f(y) one can follow the approach of monitoring f(y) = 
Pa TaK (y|0%) at each iteration over a dense grid of y values (for example, an equally- 
spaced grid of 100 values ranging from the minimum of the observed y’s minus a small 
buffer to the maximum of the observed y’s plus a small buffer). Based on these samples, 
we can compute posterior inferences. 

When running the algorithm, it is good practice to monitor Smax = max(S1,..., Sn) to 
verify that the maximum occupied component index has a low probability of being close 
to the upper bound of N. Otherwise, the upper bound should be increased. Convergence 
should be monitored on the sampled f(y) values and not to the mixture component-specific 
parameters. As discussed in Chapter 22, label ambiguity problems often lead to poor mixing 
of the component-specific parameters, but this may not impact convergence and mixing of 
the induced density of interest. 

Gibbs sampling algorithms that rely on stick-breaking representations have performed 
well in our experience. But in some cases, all of the above algorithms can encounter slow 
mixing that arises due to the multimodal nature of the posterior in which it can be difficult to 
move rapidly between different clusterings. This mixing problem can be partly addressed by 
incorporating label switching moves and there is also a literature on split-merge algorithms 
designed for more rapid exploration of the distribution of cluster allocations. 


Hyperprior distribution 


The DP precision parameter a plays a key role in controlling the prior on the number of 
clusters, and there are a number of possible strategies in terms of specifying œ. One can 
fix a at a small value to favor allocation to few clusters relative to the sample size, with 
a commonly used default value corresponding to œ = 1. This implies that, in the prior 
distribution, two randomly selected individuals have a 50-50 chance of belonging to the 
same cluster. Alternatively, one can allow the data to inform about the appropriate value 
of a by choosing a hyperprior, such as a ~ Gamma(aa, ba), and then updating a during 
the MCMC analysis. For the blocked Gibbs sampler, the gamma hyperprior is conditionally 
conjugate with the resulting conditional posterior being 


N-1 
al-~ Gamma (a +N —1, ba - 5 log(1 — v). 
h=1 


Hence, sampling from this conditional can be included as an additional step in the algorithm 
described in the previous subsection. 

In our experience, the data tend to be informative about the precision parameter a of 
the Dirichlet process, and hence there is substantial Bayesian learning, with a high variance 
prior often resulting in a concentrated, low variance posterior. It may seem counterintuitive 
that the data can inform strongly about the number of clusters through the hyperparameter 
a given that maximum likelihood estimation leads to overfitting, with more clusters resulting 
in a higher maximized likelihood. However, the Bayesian approach favors clusterings and 
values of a that result in a high marginal likelihood. If individuals are allocated to many 
different clusters, the effective number of parameters in the likelihood is increased, and we 
then integrate across a larger space in calculating the marginal likelihood. This induces an 
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intrinsic penalty that favors allocation to few clusters that are really needed to fit the data; 
there is no tendency for overfitting. 

The more difficult and subtle issue is choice of the base measure Py. Often the base 
measure is chosen for computational convenience to be conjugate. However, even in conju- 
gate parametric families such as normal-gamma, we can potentially improve flexibility by 
placing hyperparameters on the parameters in Pp. Po can be thought of as inducing the 
prior for the cluster locations. If these locations are too spread out, because Pp has high 
variance, then the penalty in the marginal likelihood for allocating individuals to different 
clusters will be large, and the tendency will be to overly favor allocation to a single cluster. 

It is crucial to consider the measurement scale of the data in choosing Po. The variance of 
Po is only meaningful relative to the scale of the data. A common approach is to standardize 
the data y and then choose Pp to be centered at zero with close to unit variance. If we set unit 
variance and do not standardize, then how flat P) is depends on the unit of measurement 
in the data—if we change from inches to miles, we may get completely different results. 


Example. A toxicology application 

As a illustrative application, we consider data from a developmental toxicology study of 
ethylene glycol in mice conducted by the National Toxicology Program. In particular, 
yi is the number of implantations in the ith pregnant mouse, with mice assigned to 
dose groups of 0, 750, 1500, or 3000 mg/kg/day. Group sizes were 25, 24, 23, and 
23, respectively. Scientific interest focuses on studying a dose response trend in the 
number of implants, and we initially focus on separately estimating the distribution 
within each group. As in many biological applications in which there are constraints 
on the range of the counts, the data are underdispersed with in this case a mean of 
12.5 and variance of 6.8 in the control group. Figure 23.2 presents a histogram of 
the raw data for the control group (25 mice), along with a series of estimates of the 
posterior probabilities Pr(y = j) assuming y; ~ P with P ~ DP(aPo), a=1 or 5, and 
Po set to Poisson(Y) for simplicity. 

This approach places a Dirichlet process prior directly on the distribution of the count 
data instead of using a Dirichlet process mixture. Counts are discrete so this seems like 
a reasonable initial approach. In addition, when a Dirichlet process is used directly for 
the distribution of the data, one can rely on the conjugacy property to avoid MCMC. 
In particular, assuming y; ʻI P for i= 1,...,25 (focusing only on the mice in the 
control group to start), and P ~ DP(a@P ), we have 


Plm ~DP(aP +Y y); 


i=l 


so that the posterior mean probability of y = j is simply 


' a , 1 7 
Pr(y = jlyi, -+ -> Yn) = (5) o+ (=) $ lys 
i=1 


where Po(j) is the probability of y = j under P) in the prior distribution. This 
expression is simply the weighted average of the prior mean and the proportion of 
cases where y = j in the observed data, with the weight on the prior being a and the 
weight on the data being n. 

To illustrate the behavior as the sample size increases, we take a random subsample 
of the data of size 10. As Figures 23.2 and 23.3 illustrate, the lack of smoothing in 
the nonparametric Bayes estimate under a Dirichlet process prior is unappealing in 
not allowing borrowing of information about local deviations from Po. In particular, 
for small sample sizes as in Figure 23.3, the posterior mean probability mass function 
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Figure 23.2 Histogram of the number of implantations per pregnant mouse in the control group 
(black line) and posterior mean of Pr(y = j) assuming a Dirichlet process prior on the distribution 
of the number of implants with a = 1,5 (gray and black dotted lines, respectively) and base measure 
Po = Poisson(y). 


Figure 23.3 Histogram of a subsample of size 10 from the control group on implantation in mice 
(black line) and posterior mean of Pr(y = j) assuming a DP prior on the distribution of the 
number of implants with a = 1,5 (gray and black dotted lines, respectively) and base measure 
Po = Poisson(y). 


corresponds to the base measure with high peaks on the observed y. As the sample size 
increases, the empirical probability mass function increasingly dominates the base. 
Potentially, by using a Dirichlet process mixture (DPM) instead of a DP directly, one 
may obtain better performance in practice. For count data, it seems natural to use a 
Poisson kernel K(-) with a gamma base measure Pp, so that 


yi ~ Poisson(0;), i~ P, P~DP(aPo), Po = Gamma(a, b). (23.8) 


In this case, we can easily work out the steps involved in the blocked Gibbs sampler. 


1. Update S; € {1,..., N} by multinomial sampling with 


Poi ilO% 
Pr(S:=c|—) = ee c=1,..., N, 
1 Te Poisson(y;|0%, ) 


where S; = c if 0; = 0% denotes that subject i is allocated to cluster h. 
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2. Update the stick-breaking weight Ve, c = 1,..., N — 1, from 


N 
Beta(1 + Me, a+ 5 ne). 


c'=c+1 


3. Update 0%, h = 1,..., N, from its conditional posterior, 


Gamma (« + yu b+ m). 


iES;=h 


with ne = )>;_, 1s,=c, the size of the c-th cluster. 
A conservative upper bound of N = n can be used for the truncation level. 
Although a Dirichlet process mixture of Poissons is the obvious choice and leads to 
simple computation, there is a lurking problem with this approach, which is a common 
issue in hierarchical Poisson models in general. In particular, the Poisson kernel is 
inflexible in that it restricts the mean and variance to be equal. In using a mixture 
of Poissons, such as the DPM in expression (23.8), one can only increase the variance 
relative to the mean. Hence, mixtures of Poissons are only appropriate for modeling 
overdispersed count distributions and produce poor results in the toxicology data on 
implantations. In particular, the estimated dose group-specific distributions of the 
number of implants under the DPM of Poissons exhibit substantially larger variance 
than the empirical distributions, suggesting a poor fit. 
For continuous data, Gaussian kernels are routinely used and do not have this pitfall in 
having separate parameters for the mean and variance. Gaussians are easily modified 
to the count case by relying on rounding. In particular, let 


Yi = h(yz'), Yi i N(mi T7"), (Hi, Ti) oF F, i= 1; e. M; Pw DP(aPo), (23.9) 


with A(-) a rounding function that has h(y*) = j if y* € (aj, aj+1] for j = 0,1,2,...,00, 
ao = —00, aj = j — 1, j =1,...,00, and Po(u, T) = N (uļuo, «7~')Gamma(r|a-,, b+). 
For this rounded Gaussian kernel, we can derive a simple blocked Gibbs sampler, 


which has the following steps. 
1. Update S; € {1,..., N} by multinomial sampling with 
Pr(S;=c|—) — Tep(Yil Me, TE) 
o = N $ * 3 
Xai Tor P(Yal bes T) 


where p(yili. T) = Play yilui,72) — O(a,|ut,72), and ®(2|p,7) is the normal 
cumulative distribution function with location u and precision T. 
2. Update stick-breaking weight Vp, h = 1,..., N — 1, from 


N 
Beta(1 + Ne,a+ 5 ne). 
c'=c+1 
3. Generate each y* from the full conditional posterior 
ui ~ U(® (ay: l5, TS, ), Play. +11HS,>75,)), YF = D7 (ulus, 75,)- 
4. Update 0* = (už, Tě), c= 1,..., N, from its conditional posterior, 


N(ue ite, Rara” )Gamma(t,|@z,, b), 


with @,, = ar + %, 6, = br + $(Dis,-c(yt — We) + a (Ge — H0)”), Re = 
(KTI + ne) t and fic = Ke(K~ tuo + nay). 
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Essentially, we just impute the latent yf within the third step of the Gibbs sampler 
and otherwise proceed as if we were modeling the data using a DPM location-scale 
mixture of Gaussians. In fact, the above steps can also be used for Bayesian density 
estimation of continuous densities in which the observed data are y* and we have no 
need for step 3. 

We repeated our analysis of the toxicology data on implantations using the DPM of 
rounded Gaussians approach, and obtained an excellent fit to the data, improving 
on the DPM of Poissons result. The empirical cumulative distribution functions are 
entirely enclosed within pointwise 95% credible intervals. To conduct inferences on 
changes in the distribution of the number of implants with dose, we estimated sum- 
maries of the posterior distributions for changes in each percentile between the control 
group and each of the exposed groups. Negative changes for an exposed group relative 
to control suggest an adverse impact of dose. The estimated posterior probabilities 
of a negative average change across the percentiles was 0.72 in the 750 mg/kg group, 
0.99 in the 1500 mg/kg group, and 0.94 in the 3000 mg/kg group. Hence, there was 
substantial evidence of a stochastic decrease in the number of implants in the higher 
two dose groups relative to control. 


23.4 Beyond density estimation 
Nonparametric residual distributions 


Density estimation has been used to this point primarily to simplify the exposition of a 
difficult topic. The real attraction of Dirichlet process mixture (DPM) models is that 
they can be used much more broadly for relaxing parametric assumptions in hierarchical 
models. This section is meant to give a flavor of some of the possibilities without being 
comprehensive. First, consider the linear regression setting with a nonparametric error 
distribution: 

yi = Xib +e, e~ f, (23.10) 


where X; = (Xi1,..., Xip) is a vector of predictors and «€; is an error term with distribution 
f. The assumption of linearity in the mean is easily relaxed as discussed earlier. Here, we 
consider the problem of relaxing the assumption that f, the distribution of errors, has a 
parametric form. 

In Chapter 17 we considered the t model as a way to downweight the influence of outliers. 
This is easily accomplished computationally by expressing the t, distribution as a scale 
mixture of normals by letting e; ~ N(0,¢,'o?), with ¢; ~ Gamma ~ (4,5). Although 
the t distribution may be preferred over the normal due to its heavy tails, it still has a 
restrictive shape and we could instead model f nonparametrically using a DP scale mixture 
of normals: 

e, ~N(0,¢;'), gi~ P, P~ DP(aP), 


where Po is chosen to correspond to Gamma(%, 4) to center the prior for f on a t distribu- 


tion, while allowing more flexibility. The resulting prior for f is flexible but is still restricted 
to be unimodal and symmetric about zero. 

An alternative which removes the unimodality and symmetry constraints is to use a 
location mixture of Gaussians for f. Removing the intercept from the X;8 term and allowing 
f to have an unknown mean, let 


ei~ N(mi,T™!), mi~ P, P~DP(aPy), T~ Gamma(a,,b-), 


with Po chosen as N(0,7~+). The computations for density estimation can be easily adapted 
to include steps for updating the regression coefficients 8 and then replacing y; with y; —X;8 
in the previous steps. 
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Nonparametric models for parameters that vary by group 


In Chapter 15 we considered hierarchical linear models with varying coefficients. Uncer- 
tainty about the distribution of the coefficients can be taken into account by placing DP or 
DPM priors on their distributions. As a simple illustration, consider the one-factor Anova 
model, 


Yij = Hi F Eijs Hif, ij ~g, (23.11) 


with y; = (yi1,---;Yin;) a vector of repeated measurements for item i, u; a subject-specific 
mean, and €;; an observation-specific residual. Typical parametric models would let f 
correspond to a N(u, %7!) density, while letting g = N(0, o°). 

To allow more flexibility in characterizing variability among subjects, we can instead let 


i~ P, P~DP(aPo), (23.12) 


where P is the unknown distribution of the varying parameters and for simplicity we model 
the residual density g as N(0,07). Placing a DP prior on the distribution induces a latent 
class model in which the subjects are grouped into an unknown number of clusters, with 


li= Hs Pr(& =h) =n A=1,2,..., (23.13) 


where S; € {1,...,co} is a latent class index, and mp is the probability of allocation to 
latent class h, with these probabilities following the stick-breaking form as in (23.3). As 
for finite latent class models, this formulation assumes that the distribution of the varying 
parameters is discrete so that different subjects can have identical values of the parameters. 
This may be useful as a simplifying assumption and the posterior means will be different 
for every subject, since the clustering is soft and probabilistic, with the posterior means of 
li obtained averaging across the posterior distribution on the cluster allocation. 

There are some practical questions that arise in considering nonparametric hierarchical 
models such as (23.12). The first is whether the data contain information to allow non- 
parametric estimation of P given that the modeled parameters u; are not observed directly 
for any of the subjects. The answer to this question and the interpretation of the resulting 
estimate depends on the number of observations per subject. Suppose initially that n; = 1 
for all subjects. In this case, we do not have any information in the data to distinguish 
variability among subjects from variability among measurements within a subject. How- 
ever, under the assumption of normality of the residual density g, we still have substantial 
information in the data about P in that P accommodates lack of fit of the normal residual 
distribution. In the general case in which n; > 1 and normal g is assumed, P has a dual 
role in allowing for lack of fit of the normal distribution for the residuals and systematic 
variability among subjects. When there are many observations per subject, that later role 
dominates, but when n; is small interpretation of P needs to take into account the dual 
roles. 

One natural possibility for removing this confounding is to also model g using a Dirichlet 
process mixture of Gaussians. In this case, the data contain less information about the 
distribution, and accurate estimation may require a data set with many observations per 
item and many items. In the case in which the distribution of the parameters and the 
residual distribution are both unknown, an identifiability issue does arise in that it is difficult 
in nonparametric Bayes models to restrict the mean of the distribution to be zero. However, 
one can run the MCMC analysis for an overparameterized model without restrictions on the 
means and then post-process to estimate the overall mean and mean-centered parameters 
and residual densities. 
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Functional data analysis 


In Chapter 21 we discussed Gaussian processes for functional data analysis, where responses 
and predictors for a subject are not modeled as scalar or vector-valued random variables 
but instead as random functions defined at infinitely many points. Here we consider a basis 
function expansion related to the approaches considered in Chapter 20. 

Let yi = (Yir, -- - , Yini) denote the observations on function f; for subject i, where y;; is 
an observation at point t,;, with ¢;; € T. Allowing for measurement errors in observations 
of a smooth trajectory, let 

vij ~ N(fi(tiz), 07). (23.14) 


and 


H 
fi(t) = a Oinbn(t), 9 = (Oi1,---, OH); 
h=1 
where b = {b)}#_, is a collection of basis functions and 6; is a vector of subject-specific 
basis coefficients. Here, we are assuming a common collection of potential basis functions, 
but by allowing elements of the 6; coefficient vectors to be zero or close to zero, we can 
discard unnecessary basis functions and even accommodate subject-specific basis function 
selection. In many applications, it is necessary to allow different subjects to have a different 
basis for sufficient flexibility. By using a common dictionary of bases across subjects, we 
allow a common backbone from which a hierarchical model for borrowing information can 
be built. 
To borrow information across subjects and model the variability in the individual func- 
tions, let 
0; ~ P, 


where the H-dimensional distribution P must be specified or modeled. Potentially, we can 
consider a parametric family in which P = Ny (0, Q), with the resulting mean function then 
corresponding to f(t) = b(t), where b(t) = (bi (t), ...,ba(t)). This mean function provides 
a population-averaged curve. In addition, the hierarchical covariance matrix Q characterizes 
heterogeneity among the subjects in their functions. There are several practical issues that 
arise with this parametric hierarchical model. Firstly, the number of basis functions p is 
typically moderate to large, and hence Q will have many parameters and it can be difficult to 
reliably estimate all these parameters. In addition, there is no allowance for basis function 
selection through shrinking the basis coefficients to zero. Although one could potentially 
choose a prior for 2 that allows diagonal elements close to zero, this would discard the 
corresponding basis functions for all subjects and does not accommodate subject-specific 
selection. Finally, the normality assumption for the varying parameters implies a restrictive 
type of variability across subjects; for example, it cannot accommodate sub-populations of 
subjects having different functions and outlying functions. 

An alternative is to use a Dirichlet process prior, P ~ DP(a@Po). This will induce 
functional clustering with 


fit) = f5,), fr) =b), Pr(Si=c)=me, 0% ~ Po, 


All individuals within cluster c will have f(t) = f2(t), with the basis coefficients charac- 
terizing the cluster c function being 6% = (0*,,...,0*;,). By choosing an appropriate base 
measure Po within the functional DP, one can allow the basis functions to differ across 
the clusters and hence allow individual-specific basis selection through cluster-specific basis 
selection. 

There are two good possibilities in this regard. Firstly, one can let Py = (Oy: Pon, with 
Pon specified to have a variable selection-type mixture form: 


Pon(-) = mondo(-) + (1 — ton)N(-10, Yz), 
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possibly with Yn ~ Gamma(%$, 4) to induce a heavy-tailed t prior for the nonzero basis 
coefficients. In sampling the cluster-specific basis coefficients from this prior, 0%, ~ Pon 
independently for h = 1,...,H, a subset of the elements of 6* will be exactly equal to 0, with 
this subset varying across the different functional clusters. By letting Top ~ Beta(a, b), one 
can allow uncertainty in the prior probability of exclusion of the h-th basis function, while 
borrowing information across functional clusters in learning about which basis functions are 
needed overall. 

This approach leads to a straightforward Gibbs sampler, incorporating minor modifica- 
tions to the blocked Gibbs sampling or finite Dirichlet approximation algorithms described 
above. After allocating subjects to clusters, we can update 6%, by direct sampling from its 
full conditional distribution, which will have a mixture form consisting of a point mass at 
zero and a normal distribution. This has a similar form to that described in Chapter 20 for 
nonparametric regression with basis selection. Each pass through the Gibbs updating steps 
will vary the allocation of subjects to functional clusters and the basis functions selected to 
characterize the functional trajectories. Based on the resulting posterior samples, we can 
obtain model-averaged estimates of the functions specific to each subject. 

Although this approach tends to work well in practice, the one-at-a-time updates of the 
6*,8 specific to each cluster and basis function can lead to a high computational burden 
for each iteration of each Gibbs sweep as well as slow mixing of the chains. An alternative 
that leads to similar estimation performance in practice, while simplifying and substantially 
speeding up posterior computation, is to use a heavy-tailed shrinkage prior in place of the 
variable selection mixture prior. In particular, one simple choice is 


Oin ~N(O,Vat), Uo, ~ Gamma(S, =), 
with v chosen to be small; for example, v = 1 provides a default that leads to a Cauchy 
marginal prior for the basis coefficients. Under this approach, the basis coefficient vectors 6* 
can be updated in a block from multivariate normal full conditional posterior distributions. 
This block updating can accelerate mixing substantially. Although this approach does not 
formally allow for basis selection in that none of the coefficients will be exactly zero, the 
prior is concentrated at zero with heavy tails and hence allows coefficients that are close to 
zero. For small basis coefficients, the corresponding basis functions are effectively excluded 
and in practice it is impossible to distinguish small nonzero coefficients from coefficients 
that are exactly zero. In most applications, it can be argued that coefficients will not be 
exactly zero in any case. 


23.5 Hierarchical dependence 


In the previous sections we discussed Dirichlet process mixture models for a single unknown 
distribution. This unknown distribution can be the distribution of the data directly or some 
component of a hierarchical Bayesian model. To build rich semiparametric hierarchical 
models, one may potentially incorporate several DPs, set to be independent in the prior 
distribution. This approach was implemented in the toxicology data analysis in the previous 
section. However, there are clear limits to this strategy and in many settings it is appealing 
to use priors that favor dependence in unknown distributions. To motivate the need for such 
generalizations, we start by describing an application in which such flexibility is warranted. 


Example. A genotoxicity application 

Suppose we have an experimental study in which observations are taken for ‘subjects’ 
in different groups on a continuous response variable. In particular, let y; denote the 
response for subject i and let x; € {1,...,T} denote the group. For example, in 
genotoxicity studies utilizing single cell gel electrophoresis (also known as the ‘comet 
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Figure 23.4 Histograms and kernel-smoothed density estimates of DNA damage across cells in each 
hydrogen peroxide dose group in the genotoxicity example. 


assay’) to measure DNA damage, y; denotes a measure of the amount of DNA dam- 
age in cell i and x; denotes the dose group of a potentially genotoxic exposure. The 
emphasis of such studies is on assessing how the density of DNA damage across cells 
changes with dose. Figure 23.4 shows histograms and kernel-smoothed density esti- 
mates within each dose group for data collected in a genotoxicity study in which the 
dose groups correspond to different levels of exposure to hydrogen peroxide, a known 
genotoxic chemical, and DNA damage is measured on the individual cell level using 
the comet assay. It is apparent from the figure that the lower quantiles of the response 
density do not change much at all with level of exposure, while the higher quantiles in- 
crease dramatically with increasing dose. This is as expected due to variability among 
the cells and to the fact that it is not possible experimentally to get the same dose to 
all cells. 

This tendency for the upper quantiles of a response density to be more sensitive to an 
exposure is certainly not unique to genotoxicity studies and is a natural consequence of 
variability among experimental units in their sensitivity to exposure. In addition, even 
beyond toxicology and epidemiology applications assessing the impact of exposures, 
it is common in broad applications to observe differential changes in the different 
quantiles with predictors. From an applied perspective, a fundamental question is 
how to model such data. A standard parametric modeling approach would let 


T 


j=2 


where u is the expected measure of damage for cells is the unexposed control group 


This electronic edition is for non-commercial purposes only. 


562 23. DIRICHLET PROCESS MODELS 


for which x; = 1, 6; is the shift in the mean response for cells in group j, and o? 
is a common variance parameter. This model falls short in assuming the response is 
normally distributed within each group and only allowing the mean to shift. 


Dependent Dirichlet processes 


A Dirichlet process provides a prior for a single random probability measure, P ~ DP(aP). 
Focusing on the comet assay application and on modeling the density of DNA damage 
within the jth group, a natural approach would be to use the DP location-scale mixture of 
Gaussians, 


flv) = J Molse Jarima. PDP 
= So TNn Pn ')> Ty ~ stick(ay), (Win, On) ~ Poz- 
h=1 


Potentially one can define separate DPMs of Gaussians for each dose group, but the question 
is then how to borrow information. 

An alternative and more general strategy is to define a dependent Dirichlet process 
(DDP) prior for the collection of dependent random probability measures {P;,..., Pr}. A 
DDP is an extremely broad class of priors for collections of random probability measures 
having the defining property that the marginal priors for each random probability measure 
in the collection are DPs. Hence {Pi,...,Pr} ~ DDP implies that P; ~ DP(aj;Po;) 
for 7 = 1,...,7. In defining DDPs, it is most convenient to rely on a stick-breaking 
representation and let 


P; => rminos T={tjn}~Q, Of = {On} R; 
h=1 


with Q and P chosen so that 7; = (71, 7j2,-..) ~ stick(a,;) and OF, Po; marginally for 
i ere oe 

It can be complicated to define a Q that leads to dependent stick-breaking processes 
having the correct marginals, and hence most of the literature has focused on so-called 
‘fixed-2 DDPs’ which let 


co 
P; =o tnd0-,, mo~ stick(a), 0} ~ Po, 
h=1 


so that the probability weights on the different components are identical across groups and 
only the atom locations vary, with dependence in the atom locations controlled by the choice 
of Po. 

Returning to the motivating comet assay application, suppose we use a fixed-7 DDP 
mixture of Gaussian kernels with 


fily) = >> Nlln on), 7 ~stick(a), on ~ Gamma(a, b), 
h= 


so that the weights and bandwidths are identical across dose groups, but the locations of 
the kernels differ. It remains to specify a joint prior for the group-specific kernel locations, 


(Mins aus < Lp) ma Po. 


To favor similarities between adjacent dose groups in the unknown density of DNA damage, 
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we can choose a prior that favors už, © 1j+41,,- One computationally convenient and flexible 
choice corresponds to drawing uï, from a Gaussian prior and then letting 


Mj44jh = Hh = Bin ~ Tod0 + (1— 70)N(0,c), (23.16) 


so that the shift in kernel locations for adjacent dose groups is zero with probability mo and 
is otherwise sampled from a Gaussian prior. To instead enforce a non-decreasing stochastic 
order constraint on the densities across dose groups, one can replace this normal with a 
normal truncated below by zero. It is clear to see that increasing values of mo will favor 
a high proportion of identical kernels and effective pooling of the dose groups; leading 
to improved efficiency in estimating the dose group-specific densities and of interest in 
hypothesis testing of near equalities in the groups. Alternatively, it may be convenient to 
use a heavy-tailed mixture prior for the B;;,s. 

Although this DDP mixture model may initially seem complicated, it is actually a 
simple modification of the DPM model for a single density and posterior computation can 
proceed along similar lines. For example, if one uses a blocked Gibbs sampler, the steps are 
essentially identical to those described in the previous chapter. Letting yi ~ N(mi, 67"), 
the steps proceed as follows: 


1. Update mixture component (cluster) index S; by sampling from a multinomial condi- 
tional posterior: 


N ; E x—1 
Pr(S; = h|-) = NOl) h=1,...,N, 


= W i a4 
a TIN (ys lee, Ot J 


with S; = h denoting that ui = Hah and ġi = @j. 
2. Update the stick-breaking weights from beta full conditions as in the previous chapter. 


3. Update the kernel-specific precisions ¢j, from gamma full conditional posteriors 


(¢;,|—) x Gamma(¢j|a,b) J| Ni moy’). 
itS=h 


4. Update yj), from its Gaussian full conditional and jn from its full conditional, which is 
a mixture of a point mass at zero and a Gaussian. 


We leave it to the reader as a straightforward algebraic exercise to derive the specific 
conditionals in steps 3—4. 


Example. Genotoxicity application (continued) 

We consider the application introduced on page 560 of a study of DNA damage and 
repair in relation to exposure to H2Oə (hydrogen peroxide). Batches of cells were 
exposed to 0, 5, 20, 50, or 100 umol of H202, with DNA damage measured in individual 
cells after allowing a repair time of 0, 60, or 90 minutes. With i = 1,...,n indexing 
the cells under study, the measured response y; for cell 2 was the Olive tail moment, 
which is a surrogate of the frequency of DNA strand breaks obtained using the comet 
assay. 

The goal of the study is to assess the sensitivity of the comet assay to detecting 
damage induced by the known genotoxic agent H2O2, while also investigating how 
rapidly damage is repaired. Let x; € {1,..., K} be a group index denoting the level 
of H2O2 and repair time for cell i. The value of x; for each dose x repair time value 
is shown in Figure 23.5, along with the known stochastic ordering restrictions among 
the groups. As repair time increases within a dose group, the DNA damage density 
is expected to decrease stochastically, while as dose increases with zero repair time, 
the DNA damage density should increase. The sample size is 1400, with 100 cells per 
group except for groups 9 and 13, which had 50. 
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Figure 23.5 Directed graph illustrating order restriction in the genotoxicity model. Arrows point 
toward stochastically larger groups. Posterior probabilities of Hix are also shown. 
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We wish to assess whether DNA damage increases with H2O2 dose, and whether dam- 
age is significantly reduced across each increment of repair time. We use a restricted 
DDP to model these data. The DNA damage density within each group is character- 
ized as a Dirichlet process location mixture of Gaussians. We parameterize in terms 
of adjacent group differences, and use a prior similar to (23.16) but modified to re- 
strict the cluster specific mean differences to obey the ordering in Figure 23.5. The 
kth directed edge in Figure 23.5 links two unknown densities that are characterized 
as DPMs of Gaussians, with identical weights and kernel bandwidths but potentially 
different locations. Let dẹ denote the total probability weight on clusters (mixture 
components) that differ between the groups linked by the kth edge. If dx is small, it 
implies that the two densities are similar, providing a simple scalar summary. 

Figure 23.5 shows posterior probabilities of local orderings for group comparisons 
corresponding to each edge. Simulation studies of frequentist operating characteristics 
suggest that this testing procedure has excellent frequentist performance in terms of 
low type I error rate and high power even in small samples with subtle differences 
between groups. Our results give strong evidence of increases in DNA damage between 
the 0, 5, and 20 umol H20% dose groups given a repair time of 0 min, with the evidence 
of further increases at higher dose levels less clear. As expected, there is no evidence 
of a change in distribution between groups 1, 6 and 11, since there was no induced 
damage to be repaired. However, there are clear decreases in DNA damage in each of 
the exposed groups after a repair time of 60 min. Allowing an additional 30 min of 
repair did not significantly alter the distribution. 

These results are consistent with subjective examination of the raw data, with the 
Bayesian nonparametric density estimates shown in Figure 23.5 consistent with fre- 
quentist kernel smoothed density estimates obtained for each group separately. The 
approach borrows strength adaptively across groups. If the data support a similar 
or identical distribution for adjacent groups in the graph, these groups are effectively 
pooled, obtaining a high posterior probability of Hp, and densities that are close to 
identical. Such borrowing dramatically reduces mean square error in estimating the 
individual densities, while producing inferences on group comparisons. 


Hierarchical Dirichlet processes 


The fixed-7 DDP is of limited flexibility in characterizing hierarchical dependence structures 
in unknown distributions. A widely-useful alternative type of DDP, deemed the hierarchical 
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Figure 23.6 Genotoxicity application. Estimated densities of the Olive tail moment in a subset of 
the H202 dose x repair groups. Solid curves are the posterior mean density estimates and dashed 
curves provide pointwise 95% credible intervals. 


DP (HDP), instead relies on letting 
P; ~ DP(aPo), Po ~ DP(8Poo), 


which corresponds to choosing independent DP priors for each P; conditionally on an un- 
known base measure Po, which is in turn also assigned a DP. As shorthand notation, we 
can let P ~ HDP(a, B, Poo). From the stick-breaking construction, it follows that 


CoO DO 
P= ` Tihðoz, Po = X Anos, 97, ~ Poo. 
h=1 h=1 


Hence, as a natural consequence of assigning a DP prior to the common base measure Po 
due to the discreteness of realizations from DPs, we use identical atoms within all the group- 
specific random probability measures, while allowing deviations in the weights across the 
groups. This leads to a different structure from the fixed-r DDP, which allows the atoms 
to vary while using the same weights. 

To illustrate this structure, suppose that 


flu) = J Nlng AP;(u,4), P-~ HDP(a, 8, Poo). 


In this case, the model introduces a common global dictionary of normal kernels with varying 
locations and scales, 
N(u;,, %71), h=; De ae 00: 


There is a central density fo(y), which is characterized by mixing over this dictionary with 
weights A, and the group-specific densities are expressed using the same dictionary but 
with weights 7; drawn from a stick-breaking process centered on A. The hyperparameter a 
controls the variability across groups in the weights, with a — 0 implying that the group- 
specific densities are Gaussian with clusters of groups sharing the same mean and precision 
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in the Gaussian kernel. At the other extreme when a —> oo, one obtains pooling across 
groups with f; = fo and fo modeled as a DPM location-scale mixture of Gaussians. The 
recycling of the same atoms across groups while allowing a simple structure accommodating 
variability in the weights has led to broad impact of the HDP in numerous applications. 
Computation is also tractable. 

Another implication of the hierarchical Dirichlet process is hierarchical clustering. To 
motivate this, we focus on an application in which 2 = 1,...,n indexes states in the US, 
j =1,...,n,; indexes hospitals within state 7, and y,; denotes a continuous measure of quality 
of care for hospital j in state 7. Supposing we let yi; ~ fi and assign an HDP location- 
mixture of Gaussians prior for the collection of state-specific quality of care densities { f;}, 
we obtain the following induced hierarchical model: 


Yij ~ N(uS,,, 05) 
Pr(Sij = h) = Tih, (uz, A) G Poo; 


where Sij = h denotes that hospital j in state 7 in assigned to quality of care cluster h. Due 
to the hierarchical structure on the weights, it is more likely that hospitals within a state 
will be assigned to the same cluster, but one can also obtain clustering of hospitals across 
states. Such soft probabilistic clustering may be of interest in certain applications, and 
also is descriptive of how the HDP borrows information across groups (in this case states). 
Due to the critical role of the hyperparameters œ and p in controlling the within-group 
dependence in clustering and the total number of clusters, respectively, it is important to 
allow the data to inform about their values through choosing hyperpriors. A common choice 
is a ~ Gamma(1,1) independently of 6 ~ Gamma(1, 1). 


Nested Dirichlet processes 


The HDP works by incorporating the same atoms in the different group-specific distribu- 
tions while allowing variability in the weights, with this leading to dependent clustering 
of subjects across groups. In many applications, it is preferable to instead cluster groups 
having identical distributions. For example, in the comet assay application, we may cluster 
dose groups having no differences in the distribution of DNA damage, while in the hospital 
application, we may cluster states having no differences in the distributions of quality of 
care across hospitals. In the former case, one obtains a nonparametric Bayesian approach 
for multiple treatment group comparisons, with posterior probabilities obtained for equal- 
ities in each pair of dose groups as well as in all the dose groups. Such probabilities can 
form the basis of Bayesian testing of hypotheses about equalities in the treatment groups. 

To accomplish such distributional clustering, one can rely on a nested Dirichlet process 
(NDP) mixture. The NDP can be expressed as follows: 


P;~P, P~ DP(aP)), Po =DP(BPoo); (23.17) 


On first glance, this seems similar to the HDP, which would draw the group-specific random 
probability measures from a DP with a DP prior on the base. However, here we instead 
draw the P;s from a common random probability measure, which is drawn from a DP with 
the base being a DP instead of a realization from a DP. The distinction becomes clear when 
we examine the stick-breaking representation of the NDP which has the form: 


P)~P= So médpx, m~stick(a), Pf ~ DP(8Poo). (23.18) 


h=1 


Hence, P takes the usual DP stick-breaking form but with the atoms corresponding to 
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random probability measures drawn iid from a DP. This leads to clustering of the random 
probability measures in which the prior probability of P; = Pj is TE as an automatic 
consequence of the DP Polya urn scheme. However, if P; and P; are in different clusters, 
they would have distinct atoms drawn independently from Poo. This is different from the 
HDP, which formulates P; using a common set of atoms but with distinct weights, so that 
Pr(P; = Pj) = 0. 

In practice, the P;’s are typically used as mixture distributions in NDP mixture models 
for collections of group-specific densities. For example, 


fiy) = J N(ylu, 6 )dP;(,¢), P ~ NDP(a, 8, Poo), 


with P ~ NDP(a, 8, Poo) used as shorthand for the prior in (23.17)-(23.18). In this case, 
the group-specific densities will be allocated to clusters, with Pr(f; = fy) = T We can 
choose a hyperprior distribution a ~ Gamma(a, b) with a, b elicited to obtain desired values 
for Pr(Ho;;’) and Pr(Ho). This allows the data to inform about a, and it tends to be the 
case that the data inform more strongly as T increases, leading to a so-called ‘blessing of 
dimensionality.’ 

A natural NDP-HDP modification that has been implemented successfully in the ma- 
chine learning literature is to place a DP on Poo within (23.18) so that the cluster-specific 
densities share a common set of global atoms but with varying weights. This combines the 
NDP and HDP priors, potentially exploiting the advantages of both approaches. 


Convex mixtures 


Both the HDP and NDP are special cases of the DDP framework in that they incorporate 
dependence in a collection of random probability measures while maintaining DP priors for 
the individual RPMs marginally. Although the DP has some appealing properties and is 
in some sense a canonical case, it certainly limits flexibility in modeling to always restrict 
attention to DDPs and to not consider broader classes of priors that incorporate dependence 
in a different manner without maintaining DP marginals. One alternative approach to in- 
ducing dependence is to use random convex combinations of component random probability 
measures. For example, suppose that interest focuses on combining data from longitudinal 
studies conducted at different study centers following a closely-related protocol. In partic- 
ular, let ycij denote the response for the ith individual from center c at time teij, and let 
Zeij denote corresponding covariates (potentially including basis functions in time to allow 
non-linear coefficients). Then, one may consider a hierarchical model such as 


Ycij ~ N(2cij Bet + €cij) 0°), 


where ei is a p X 1 vector of coefficients specific to center c and individual 7, €cij is a 
residual, and g? is the residual variance. The question is then how to borrow information 
across subjects from the different centers. If one assumes a parametric model, then it is 
natural to use a multi-level structure that decomposes Bei as Bei = Qe +i, with the center- 
specific effect a, modeled as multivariate Gaussian centered on a and the individual-specific 
deviation Y; modeled as multivariate Gaussian centered on zero. 

As a more flexible semiparametric approach, we could let 


Bei ~ Pe, {P.} ~ I, 


where P, is a distribution characterizing variability among subjects in center c, and II is 
a joint prior for the different distributions. Potentially, either the HDP or NDP could be 
used for II but a simple alternative is to let 


P: = nGo + (1— r)Ge, Ge~DP(aGo), m ~ Beta(a, b), (23.19) 


This electronic edition is for non-commercial purposes only. 


568 23. DIRICHLET PROCESS MODELS 


so that the distribution in group c is formulated as a mixture of a global distribution Go 
(with probability weight 7) and a group-specific distribution Ge. The higher the probability 
weight m on Go the more similar the distributions across the groups. While having a similar 
motivation, this prior differs from the HDP in including a separate set of global and group- 
specific atoms. Subjects allocated to the global atoms within Go can be clustered with 
subjects in other groups, while subjects allocated to the center-specific atoms will only be 
clustered with other subjects in the same center. Posterior computation is straightforward 
by simply using a data augmentation approach, which introduces a binary indicator Zei = 1 
denoting allocation to the global component. After imputing these indicators from their full 
conditional, one has conditionally independent DP priors and algorithms for computation 
in DPs can be used directly. Marginally, Pe does not have a DP prior, and hence II is not 
a DDP. 

Expression (23.19) is reminiscent of a Gaussian hierarchical model with an overall mean 
and center-specific deviations. However, since we are modeling random probability measures 
instead of real-valued random variables, it is natural to use convex combinations instead 
of an additive structure. By using convex combinations of component random probability 
measures, we ensure that the result is a probability measure. Related convex combinations 
can be used broadly to incorporate more structured dependence in collections of measures. 
For example, recall the comet assay application discussed earlier in this chapter. In that 
case, the dose groups have a natural ordering and it makes sense to choose a prior that 
favors greater dependence in P, and P;,, than P, and P;+2. This can be accomplished by 
defining a first order autoregressive model, 


P=(l—-mPi1+7G:, Py=Go, Gi~DP(aP), (23.20) 


so that the RPM for dose group t is equal to a mixture of the random probability measure 
in the previous dose group and a dose group-specific deviation. This is similar in motivation 
to a Gaussian random walk, but is instead a random walk in the space of measures, with 
the parameter 7 controlling the level of dependence. Potentially, one can add the addition 
flexibility of letting 7 depend on t. This type of dynamic mixture of DPs model is also useful 
in time series applications, but one potential disadvantage that arises is that atoms can only 
be introduced as time goes on and never entirely disappear, though atoms introduced early 
on are assigned decreasing weight. One way to circumvent this problem is draw Po from 
an HDP, so that the same atoms are used repeatedly over time. Such a structure has been 
successful applied to analyze music data in the literature. 


23.6 Density regression 


The previous section focused on the case in which we have a finite collection of random 
probability measures corresponding to different groups that either follow a simple ordering 
or are exchangeable. In many applications, the setting is not so simple and it is more 
natural to consider uncountable collections, 


Py = {P tE X}, XCR?, 


where x = (z1,...,£p) is a vector of predictors, P, is the random probability measure 
specific to predictor value x, ¥ denotes the domain of the predictors, and Py is a collection 
of random probability measures defined at every predictor value. One motivation arises 
in the setting of density regression. In Chapter 21 we discussed density regression using 
Gaussian processes and next we discuss alternative based on Dirichlet processes. 

We would ideally allow the entire conditional density of the response given predictors, 
p(y|x), to change flexibly as x changes. One approach that has been widely used is the 
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hierarchical mixture-of-experts model, which lets 


H 
pyla) = X ma (2)N(ylzBn, 7, *): (23.21) 
h=1 


This corresponds to expressing the conditional density as a mixture of normal linear regres- 
sions with weights on the different mixture components varying flexibly with predictors. In 
the case of finite H in the machine learning literature, a common approach is to rely on a 
probabilistic tree model for m(x), though one can also use a simpler approach such as a 
logistic regression. 


Dependent stick-breaking processes 


As a nonparametric Bayesian density regression model relying on mixtures, we could use 
plula) = f Nalzb, T »)AP:(6;7), Py [~ Tx, 


which is a predictor-dependent mixture of linear regressions as in (23.21), but in a more 
general form. Here, Iæ denotes the prior on the uncountable collection of mixing measures 
{P£ E€ X}. It is useful to center the prior on a reasonable parametric model for the 
data to favor collapsing of the posterior close to the truth when the parametric model is 
approximately correct. 

The main issue is how to choose IIx. In this regard, it is natural to think of predictor- 
dependent stick-breaking processes that let 


Py = XD m (2)öoa) mala) = Vale) [A - ve), 


h=1 l<h 


where ma (x) is the weight on component h specific to predictor value x, V;,(a) is the propor- 
tion of the probability stick broken off at step h, and 0; (x) is a predictor-dependent atom. 
Here, we will focus for simplicity (computationally and conceptually) on the case in which 
Ox (x) = 0% = (6%,7;), so that there is a single global collection of regression coefficient 
vectors and precisions. We then have 


pyle) = X ma(@)N(yla6h. te '), mala) = Vala) TT - Vila), 
h=1 l<h 


providing a generalization of (23.21) to include infinitely many experts. We would like to be 
able to use a small number of experts for most of our data, which can be favored by choosing 
a prior under which m(x) decreases rapidly in the index h. If Vp ~ Q is generated iid from 
a stochastic process with V;,(a) ~ Beta(1,qa) marginally for all x, then we would obtain a 
DDP mixture of linear regressions. Various choices of Q have been proposed, including an 
order-based DDP and a local DP. 

However, there are some distinct advantages computationally and in terms of practical 
performance in not restricting attention to DDPs. One prior that has had good practical 
performance in a variety of applications and has been shown to lead to large support and 
posterior consistency in estimating conditional densities is the kernel stick-breaking process, 


Vi, (x) = Ky (t, Th) iis Wh, ~ H, Tp, ~G, Vha ~ Beta(1, a), 


where Ky(-, T) is a kernel bounded above by one located at T with bandwidth w. The kernel 
Ky, (x,T),) is chosen to obtain its maximum value of 1 for x = I» in which case Vp (x) = Vp. 
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As x moves away from I», the kernel decreases leading to a corresponding decrease in Vp (x). 
One can view this process as generating a random location I» with a corresponding stick- 
breaking random variable V, and atom ((%,7;'). Due to the incorporation of the kernel, 
mp(x) will tend to be larger when z7 is located close to T'a, particularly when the index h is 
small. The spatial variability in the weights across the predictor space is controlled by the 
kernel bandwidths and by allowing kernel-specific bandwidths, we allow more rapid changes 
in certain regions. As the kernels become flat, so that Ky, (£,Ta) © 1 for all x and h, we 
obtain a DP stick-breaking process as a limiting case. 

Posterior computation for kernel stick-breaking process mixtures tends to be straightfor- 
ward and efficient, for example by prespecifying a grid of potential values for the bandwidths 
and locations to facilitate Gibbs sampling. However, there are some computational advan- 
tages to an alternative probit stick-breaking specification, which lets 


m() = Valz) [ [0 - Vile), Vale) = Blan + un(a)), an ~ N(u, 1), 
I<h 


where ®(z) is the standard normal cdf and up : ¥ — R is an arbitrary regression model. To 
motivate the probit stick-breaking process (PSBP), initially consider the baseline case in 
which there are no predictors so that Vp (£) = Vah = ®(an), with an ~ N(u, 1). This model 
is similar to the Dirichlet process, but instead of generating the Vps iid from Beta(1, a) we 
obtain the Vp’s by transforming iid N(y, 1) draws to the unit interval via a standard normal 
cdf (one can alternatively use a logistic transformation and obtain a logistic stick-breaking 
process). In the special case in which u = 0, we obtain V, ~ Beta(1, 1) and hence the PSBP 
with u = 0 and DP with precision 1 are equivalent. In general, the js hyperparameter plays 
a similar role to the precision a in the DP in controlling the rate of decrease in the stick- 
breaking random variables and associated prior on the number of clusters in the sample. 
For large u, we obtain V, ~ 1 and hence large weight on the first component similarly to 
a 7% 0 in the DP. 


Example. Glucose tolerance prediction 

We apply the probit stick-breaking process to an epidemiological study of diabetes. 
The focus was on assessing the relationship between y; = glucose tolerance (GT), xi 
= log-transformed insulin sensitivity (IS) and other diabetes risk factors xj2=age, £i3 
= waist to hip ratio (WTH), 2;4 = body mass index (BMI), 2;5 = diastolic blood 
pressure (DBP), and xjg = systolic blood pressure (SBP) in n = 868 patients. GT is 
measured by 2-hour plasma glucose level (mg/dl) in the oral glucose tolerance test and 
indicates how fast glucose is cleared from the blood. GT is also used to diagnose type 
2 diabetes using < 140 (normal), [140,200] (prediabetes), and > 200 (diabetes). IS 
provides an indicator of how well the body responds to insulin, a hormone regulating 
movement of glucose from the blood to body cells. 

Figure 23.7 plots 2-hour plasma glucose level against IS, age, waist-to-hip ratio (WTH), 
body mass index (BMI), diastolic blood pressure (DBP), and systolic blood pressure 
(SBP). There is a large right skew in the glucose distribution, with the distributional 
shape changing with IS. As linear or nonlinear mean or median regression models are 
not supported for these data, we apply a Bayesian density regression approach to al- 
low the distribution of 2-hour glucose to change flexibly with the different risk factors, 
while also allowing risk factors to drop out of the model and to have effects that are 
local to particular regions of the predictor space. 

The marginal posterior probabilities for the respective predictors were 1.0, 1.0, 0.03, 
0.02, 0.03, and 0.03, implying that IS and age are important factors for the change of 
glucose distribution but the other predictors can be discarded. Figure 23.8 shows the 
estimated conditional density p(y|a) with IS and age varying across their 5th, 50th, 
and 95th empirical percentiles. The glucose density has a heavy right tail for low IS 
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Figure 23.7 Data from glucose-tolerance study: y = 2-hour glucose level (mg/dl); x1 = insulin 
sensitivity; x2 = age; x3 = waist to hip ratio; x4 = body-mass index; x5 = diastolic blood pressure; 
xe = systolic blood pressure. 


but, as IS increases, the right tail disappears. The right tail characterizes the group 
of people whose 2-hour glucose level is above 200 mg/dl (reference line is 0.2 with 
standardization). The right tail becomes heavier as age increases especially for those 
subjects with low IS, meaning that aging is also related to poor GT. 


23.7 Bibliographic note 


For recent practically-motivated introductory overviews of Bayesian nonparametrics, see 
Dunson (2009, 2010b). For recent articles on asymptotic properties of Dirichlet process 
mixtures, see Shen and Ghosal (2013) and Tokdar (2011). For recent articles on Bayesian 
computation in Dirichlet process mixtures models including novel approaches for fast com- 
putation in high-dimensional settings and citations to the approaches referred to in this 
chapter, see Wang and Dunson (2011la) and Carvalho et al. (2010). For references on the 
use of Dirichlet process priors for hierarchical models, see Kleinman and Ibrahim (1998) and 
Ohlssen, Sharples, and Spiegelhalter (2007). Ray and Mallick (2006), Rodriguez, Dunson, 
and Gelfand (2009) and Bigelow and Dunson (2009) use Dirichlet processes in developing 
models for functional data. The Bayesian bootstrap was introduced by Rubin (1981b). 
The chapter has focused on Dirichlet process mixture models. But other families of 
distributions can also work well to model unknown densities in Bayesian hierarchical models. 
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Figure 23.8 Predictive (dashed) conditional response density p(y|x) and 95% credible intervals 
(dash-dotted) with normalized xı (insulin sensitivity) and x2 (age) varying among 5th, 50th, 95th 
empirical percentiles. 


Indeed, there is a rich literature proposing many alternative priors, such as Polya trees 
(Lavine, 1992), mixture of Polya trees (Hanson and Johnson, 2002), and normalized random 
measures with independent increments (James, Lijoi, and Prunster, 2009). 

The dependent Dirichlet process (DDP) was originally proposed by MacEachern (1999) 
and was subsequently used to develop Anova models of random distributions by De Iorio et 
al. (2004) and for nonparametric spatial data analysis by Gelfand, Kottas, and MacEachern 
(2005). There have been numerous applications in different settings, with De la Cruz- 
Mesia, Quintana, and Muller (2009) using DDPs for semiparametric Bayes classification 
from longitudinal predictors and Dunson and Peddada (2009) developing an alternative 
restricted DDP for modeling of stochastically ordered densities. Wang and Dunson (2011b) 
recently developed restricted DDP mixtures for modeling of densities that are stochastically 
non-decreasing in a continuous predictor. Most of the focus has been on DDPs with fixed 
probability weights on the mixture components, but Griffin and Steel (2006) propose an 
order-based DDP to allow varying weights, and Chung and Dunson (2011) propose a local 
DP that allows varying weights in a related but simpler manner. Hierarchical DPs (Teh et 
al., 2006) and nested DPs (Rodriguez, Dunson, and Gelfand, 2009) are applications of the 
DDP framework to hierarchical dependence structures. 

There is a rich literature on alternatives to DDPs for formulating dependence in ran- 
dom probability measures. One key advance was the approach of Muller, Quintana, and 
Rosner (2004), which formulated dependence in group-specific random probability mea- 
sures by using convex combinations of a global random probability measure (RPM) with 
group-specific RPMs. This convex combinations strategy was extended to accommodate 
continuously varying collections of RPMs by Dunson, Pillai, and Park (2007) using kernel- 
weighted convex combinations of DPs. An alternative strategy has relied on generalized 
stick-breaking processes, which replace the beta random variables in the DP stick-breaking 
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representation with more complex forms. For example, Dunson and Park (2009) propose a 
kernel stick-breaking process, while Reich and Fuentes (2007) apply a related approach to 
modeling of hurricane winds. Chung and Dunson (2009) and Rodriguez and Dunson (2011) 
proposed an alternative probit stick-breaking process, while Ren et al. (2011) instead use a 
logistic stick-breaking process motivated by imaging applications. Dependent stick-breaking 
processes for time series of random distributions have been considered by Dunson (2006), 
Rodriguez and ter Horst (2008), and Griffin and Steel (2011), among others. 

There is also a literature on alternatives to stick-breaking processes for characterizing 
dependence in RPMs, with a recent emphasis on normalized random measures with indepen- 
dent increments (James, Lijoi, and Prunster, 2009). For example, Griffin (2011) proposes a 
class of priors for time-varying random probability measures through normalizing stochastic 
processes derived from Ornstein-Uhlenbeck processes. 


23.8 Exercises 
1. The following exercise is useful to gain familiarity with posterior computation and infer- 
ences for the Dirichlet process mixture of Gaussian models: 


(a) Simulate data from the following mixture of normals: 
plyi) ~ 0.1 N(y|—1, 0.2) + 0.5 N(y|0, 1) + 0.4 N(y|1, 0.4), i = 1,...,100. 


(b) Use the density() function in R to obtain a non-Bayesian estimate of the density 
and plot this estimate versus the true density. 

(c) Apply the finite mixture model Gibbs sampler described in Chapter 22 for k = 20, 
a a=1,u9 =0,andk=a,=b,=a=1. 


=) 
(d) Run the blocked Gibbs sampler for N = 20 and the same hyperparameter specification. 


(e) Compare the resulting density estimates. 
For sufficiently many MCMC iterations and sufficiently large truncation levels k, N, the 


density estimates obtained via the finite Dirichlet approximation and truncated stick- 
breaking approximations to the DPM of Gaussians should be similar. 


2. To get an intuition for the impact of a and Po, repeat the previous exercise but with: 
(a) A higher value of a, such as a = 10 
(b) A gamma hyperprior distribution on a, with aa = ba = 0.1 
(c) Much higher variance in the normal-gamma Po. 
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Appendix A 


Standard probability distributions 


Tables A.1 and A.2 present notation, probability density functions, parameter descriptions, 
means, modes, and standard deviations for several standard probability distributions. We 
use the standard notation @ for the random variable (or random vector), except in the case 
of the Wishart and inverse-Wishart, for which we use W for the random matrix, and LKJ 
correlation, for which we use © for a correlation matrix. 

Realistic distributions for complicated multivariate models, including hierarchical and 
mixture models, can typically be constructed using, as building blocks, the simple distribu- 
tions listed here. In our own work we use preprogrammed random number routines (many 
available in R, for example), but it can be valuable to understand where these numbers 
come from. 

The starting point for any simulations are functions that draw pseudorandom samples 
from the uniform distribution on the unit interval. Much research has been done to ensure 
that the pseudorandom numbers are appropriate for realistic applied tasks. For example, a 
sequence may appear uniform in one dimension while m-tuples are not randomly scattered 
in m dimensions. 


A.1 Continuous distributions 
Uniform 


The uniform distribution is used to represent a variable that is known to lie in an interval 
and equally likely to be found anywhere in the interval. A noninformative distribution 
is obtained in the limit as a 4 —oo, b > oo. If u is drawn from a standard uniform 
distribution U(0,1), then 6 = a+ (b — a)u is a draw from U(a, b). 


Univariate normal 


The normal, or Gaussian, distribution is ubiquitous in statistics. Sample averages are 
approximately normally distributed by the central limit theorem. A noninformative or 
flat distribution is obtained in the limit as the variance ø + oo. The variance is usually 
restricted to be positive; ø = 0 corresponds to a point mass at 0. The density function is 
always finite, the integral is finite as long as ø is finite. If z is a random deviate from the 
standard normal distribution, then ð = u + oz is a draw from N(,07). 

Two properties of the normal distribution that play a large role in model building and 
Bayesian computation are the addition and mixture properties. 

The sum of two independent normal random variables is normally distributed. If 41 
and 62 are independent with N(1,07) and N(ji2, 03) distributions, then 01 + 02 ~ N(j1 + 
[2,07 +03). The mixture property states that if 01|02 ~ N(02,07) and 02 ~ N(u2, 03), then 
bı ~ N(u2,0? +03). This is useful in the analysis of hierarchical normal models. 
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Inverse-gamma 


Chi-square 


Inverse-chi-square 


Scaled 
inverse-chi-square 


Exponential 


Laplace 


(double-exponential) 


Weibull 


Wishart 


Inverse- Wishart 


LKJ correlation 


p(@) = Gamma(6]a, 8) 


6 ~ Inv-gamma(a, 3) 
p(@) = Inv-gamma(6|a, B) 


O~ x2 
p(0) = x2(8) 


6 ~ Inv-x? 
p(0) = Inv-x2(6) 


6 ~ Inv-x?(v, 8°) 
p(0) = Inv-x2(6[v, <2) 


0 ~ Expon({) 
p(6) = Expon(0|8) 


0 ~ Laplace(u, o) 
p(0) = Laplace(0|u, o) 


0 ~ Weibull (a, 3) 
p(0) = Weibull(6]a, 8) 


W ~ Wishart, (S) 
p(W) = Wishart, (W |S) 
(implicit dimension k x k) 


W ~ Inv-Wishart, (97t) 


p(W) = Inv-Wishart, (W| S7 t) 


(implicit dimension k x k) 
X ~ LkjCorr(n) 

p(X) = LkjCorr(S]7) 
(implicit dimension k x k) 


578 A. STANDARD PROBABILITY DISTRIBUTIONS 
Table A.1 Continuous distributions 
Distribution Notation Parameters 
; 0 ~ U(a, B) boundaries a, 3 
aa pO = U(6la, B) with B >a 
0 ~ N(u, 07) location p 
Normal p(6) = N(6|u, 02) scale o > 0 
6 ~ lognormal(ju, o?) location u 
a aa p(0) = lognormal (8| u, o°) log-scale o > 0 
Multivariate oo Ne) symmetric, pos. definite, 
p(9) = Nl, X) i i 
normal : Da : d x d variance matrix © 
(implicit dimension d) 
Cani 0 ~ Gamma(a, 3) shape a > 0 


inverse scale 8 > 0 


shape a > 0 
scale 8 > 0 


degrees of freedom v > 0 


degrees of freedom v > 0 


degrees of freedom v > 0 
scale s > 0 


inverse scale 3 > 0 


location u 
scale o > 0 


shape a > 0 
scale 8 > 0 


degrees of freedom v 
symmetric, pos. definite 
k x k scale matrix S 


degrees of freedom v 
symmetric, pos. definite 
k x k scale matrix S$ 


shape 7 > 0 
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Density function Mean, variance, and mode 
ee e o 

E0)= 
p0) = wa 8 € [a, 8] var(0)= Ga) 
no mode 
i E(@) =p 
pO) = Vane exp (—5 z (0 — p)’ ) var(0) = o? 
mode(@) = u 
E(0) = exp(u + 307), 
p(0) = (V700)! exp(—s}r(log — 11)?) var(8) = exp(2u + 0)(exp(o?) — 1) 
mode(@) = exp(u — o°) 
(0) = (21) -4/2|y)-1/2 E(@) = u 
x exp (—4(0 — u)TE-+ (0 — p)) bial Gd nes 
Pio H H mode(0) = p 
; E(6) = 3 
pO) = yay te, O>0 var(9) = gs 
mode(@) = s5, fora>1 
E(6) = 1? fora> 1 
a 2 
p8) = mgt tD e-e, 0>0 var(8) = Grier a>? 
mode(@) = — 
— 2°"? py /2-1,-0/2 E(0) =v 
p(8) = ro’ ° i ii var(0) = 2v 
same as Gamma(a = 4,8 = 5) mode(9) = v—2, forv>2 
F E(0) = , forv > 2 
= /2 p—(v/24+1) ,—1/(26) 
pe )= D/2 a E 4 ’ 0 a 0 var(0) = a 
same A Tie gamma(a = $,8 = 5) mode(6) Ea 
; B(6) = 37 
pO) Z (v/2yr/? s¥Q-(V/2+1) es /(20) 6>0 ( ) p—2° oe ji 
T(v/2) P 7 2 var(0) = way § 
same as Inv-gamma(a = 5,8 = $s“) mode(9) = 458? 
E(0) = 4 
(8) = Be-**, 8>0 a 
same as Gamma(a = 1, 8) van) = a 
, mode(@) = 0 
E(6) =p 
p(@) = + exp (- j var(0) = 20° 
mode(@) = u 
E(@) = BP. + =) 
p(@) = 50°! exp(—(8/8)*), 0>0 var(9) = PTU + 2) — Ta +4) 
mode(@) = B(1 — +)/¢ 


p(W) = (Qa Hy its T (=)) 
x| S| 7/2Ww |” -1-9/2 E(W) =vVS 
x exp (—4tr(S~!W)), W pos. definite 

p(W) = Ce ee a r =) 
x|S|Y/2|W |e +r+9/2 E(W) =(v—k— Ds 
x exp (—4tr(SW!)), W pos. definite 

p(X) = det(d)""! 
X22 (20-2+k—i)(k—i) E(2) = In, 


x Ti (B (4 $))" 
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Table A.1 Continuous distributions continued 
Distribution Notation Parameters 
0 ~ ty (pu, 07) degrees of freedom v > 0 
t p(0) = tu (Olu, 07) location 4 
ty is short for t,(0, 1) scale ø > 0 
6 ~ ty (j4,5) degrees of freedom v > 0 


location u = ([1,.-, Ha) 
0) = tulu, X ae 
p(0) = tu (Olu, £) symmetric, pos. definite 
d x d scale matrix X 


Multivariate t 
(implicit dimension d) 


Beta 0 ~ Beta(a, 3) ‘prior sample sizes’ 
p0) = Beta(Ola, 8) a>0,68>0 
Dirichlet 6 ~ Dirichlet (ai,.., ax) ‘prior sample sizes’ 
p(@) = Dirichlet(O|a1,.., ax) a; > 0; ao = D Qj 
ae 6 ~ Logistic(j, o) location u 
poetic p(O) = Logistic(0| u, o) scale ø > 0 
be i 6 ~ Log-logistic(a, 3) scale a > 0 
Does pete p(0) = Log-logistic(@|a, 8) shape 8 > 0 
Table A.2 Discrete distributions 
Distribution Notation Parameters 
; 0 ~ Poisson(A) e 
Poisson pay = Poisson(6|X) rate’ A > 0 
: ‘sample size’ 
Binomial K ( a ) n (positive integer) 
P= BEA ed ‘probability’ p € [0, 1] 
‘sample size’ 
: : 6 ~ Multin(n; pı,- -, pk) n (positive integer) 
Multinomial p(0) = Multin(@|n; p1,.., pk) ‘probabilities’ p; € [0, 1]; 
k 
diel py=l 
Negative 6 ~ Neg-bin(a, 3) shape a > 0 
binomial p(0) = Neg-bin(0|a, 8) inverse scale 8 > 0 
‘sample size’ 
Beta- 0 ~ Beta-bin(n, a, 8) n (positive integer) 
binomial p(0) = Beta-bin(6|n, a, 8) ‘prior sample sizes’ 


a>0,8>0 
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Density function Mean, variance, and mode 
E(0) = u, forv>1 
P(O) = nUl + EEP ee? var(8) = zžz0”, fory>2 
mode(@) = u 
_ _T((v+d)/2) — E(8) = p, forv>1 
p0) = BPT >I 1/2 ae var(0) = 455, for v>2 
P E(0) = aS 
PO) = mta 1 (L — 8° var() eh 
6 € [0,1] GG 
mode(@) = PENE) 
E(9;) = = 
p(0) = read gi ger? var(0;) = suf | 
Digs 0k > 0; Eja Gai cov(9i, 95) = -2e 
mode(@;) = =; 
(0) exp(- 1) n i 4,2 
PNY sire 54) mO 
mode(@) = u 
1 
Ta Wan 
PO) = TT 0>0 var(8) = a? EGD B>2 
2 1 
mode(@) =a (4) ” B>l 
Density function Mean, variance, and mode 
p(0) = 4A exp (—A) E(0) = A, var(0) = à 
f=0, l2 mode(0) = |à] 
n - E(0) = np 
0) = 0 1- p)” 0 
ee who 


mode(@) = |(n + 1)p| 


. p” E(0;) = np; 
var(6;) = np;(1 — pj) 
cov(6;, 6;) = =NPiPj 


Q g 
— (9+a-1 B —2 
p(6) = (427) (ae) (sh) BE) = 3 
0 =0,1,2,... var(0) = zz (8 + 1) 
— T(n+1) T(a+0)0(n+8-6) = a 
po a P eeN —Tlathtny E(9) = nap TEETE 
at = ap (a n 
XTE 9 =90,1,2,....7 var (9) = NaF ayatBH) 
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Lognormal 


If ô is a random variable that is restricted to be positive, and log @ ~ N(, 07), then @ is said 
to have a lognormal distribution. Using the Jacobian of the log transformation, one can 
directly determine that the density is p(0) = (/2700)~! exp(—z4z(log 6 — )”), the mean 
is exp(u + $07), the variance is exp(21) exp(o”)(exp(o”) — 1), and the mode is exp(u— o°). 
The geometric mean and geometric standard deviation of a lognormally distributed random 
variable 0 are simply e” and e”. 


Multivariate normal 


The multivariate normal density is always finite; the integral is finite as long as det(®~1) > 
0. A noninformative distribution is obtained in the limit as det(©~') — 0; this limit 
is not uniquely defined. A random draw from a multivariate normal distribution can be 
obtained using the Cholesky decomposition of © and a vector of univariate normal draws. 
The Cholesky decomposition of © produces a lower-triangular matrix A (the ‘Cholesky 
factor’) for which AAT = D. If z = (z1,..., Za) are d independent standard normal random 
variables, then 0 = u+ Az is a random draw from the multivariate normal distribution with 
covariance matrix ). 

The marginal distribution of any subset of components (for example, 8; or (6;,6;)) is also 
normal. Any linear transformation of 6, such as the projection of 0 onto a linear subspace, 
is also normal, with dimension equal to the rank of the transformation. The conditional 
distribution of 6, constrained to lie on any linear subspace, is also normal. The addition 
property holds: if 6, and 62 are independent with N(u1, 41) and N(j2, U2) distributions, 
then 6; + 62 ~ N(uı + wa, ¥1 + Ng) as long as 6; and 62 have the same dimension. We 
discuss the generalization of the mixture property shortly. 

The conditional distribution of any subvector of 0 given the remaining elements is once 
again multivariate normal. If we partition @ into subvectors 6 = (U,V), then p(U|V) is 
(multivariate) normal: 


E(U|V) = E(U)+cov(U,V)var(V)~1(V —E(V)), 
var(U|V) = var(U) — cov(U, V)var(V)~‘cov(V, U), (A.1) 


where cov(V, U) is a rectangular matrix (submatrix of ©) of the appropriate dimensions, and 
cov(U, V) = cov(V,U)". In particular, if we define the matrix of conditional coefficients, 


G=J—(diag(S*)|*D*, 


then 
(6; | 05, all j #4) ~ Nin + 9 cig (05 — ui), (ETE). (A.2) 
ji 
Conversely, if we parameterize the distribution of U and V hierarchically: 


UIV ~ N(XV, Xuv), V ~ N(uv, dv), 


then the joint distribution of 0 is the multivariate normal, 


p= U N X py Xy XT + Xyşv X My 
= V HV d Ey XT Liv : 
This generalizes the mixture property of univariate normals. 
The ‘weighted sum of squares,’ S'S = (0 — p)?D-1(6 — u), has a x3 distribution. For 


any matrix A for which AAT = 5, the conditional distribution of A~!(6 — p), given S'S, is 
uniform on a (d—1)-dimensional sphere. 
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Gamma 


The gamma distribution is the conjugate prior distribution for the inverse of the normal 
variance and for the mean parameter of the Poisson distribution. The gamma integral is 
finite if a > 0; the density function is finite if œ > 1. Many computer packages generate 
gamma random variables directly; otherwise, it is possible to obtain draws from a gamma 
random variable using draws from a uniform as input. The most effective method depends 
on the parameter a; see the references for details. 

There is an addition property for independent gamma random variables with the same in- 
verse scale parameter. If ĝı and 62 are independent with Gamma(aı, 8) and Gamma(a2, 3) 
distributions, then 0; + 62 ~ Gamma(aı + a2, 8). The logarithm of a gamma random 
variable is approximately normal; raising a gamma random variable to the one-third power 
provides an even better normal approximation. 


Inverse-gamma 


If 67! has a gamma distribution with parameters œ, 8, then 0 has the inverse-gamma dis- 
tribution. The density is finite always; its integral is finite if a > 0. The inverse-gamma is 
the conjugate prior distribution for the normal variance. A noninformative distribution is 
obtained as a, 8 > 0. 


Chi-square 

The x? distribution is a special case of the gamma distribution, with a = v/2 and 8 = 4. 
The addition property holds since the inverse scale parameter is fixed: if 0; and 62 are 
independent with x2, and x?, distributions, then 0; + 02 ~ XZ 4,,- 


Inverse chi-square 


The inverse-? is a special case of the inverse-gamma distribution, with a = v/2 and 8 = 4. 
We also define the scaled inverse chi-square distribution, which is useful for variance param- 
eters in normal models. To obtain a simulation draw @ from the Inv-x?(v, s?) distribution, 
first draw X from the x? distribution and then let 6 = vs?/X. 


Exponential 


The exponential distribution is the distribution of waiting times for the next event in a 
Poisson process and is a special case of the gamma distribution with a = 1. Simulation of 
draws from the exponential distribution is straightforward. If U is a draw from the uniform 
distribution on [0,1], then —log(U)/6 is a draw from the exponential distribution with 
parameter £. 


Weibull 


If 0 is a random variable that is restricted to be positive, and (0/6)° has an Expon(1) 
distribution, then 0 is said to have a Weibull distribution with shape parameter a > 0 
and scale parameter (6 > 0. The Weibull is often used to model failure times in reliability 
analysis. Using the Jacobian of the log transformation, one can directly determine that the 
density is p(0) = 320° * exp(—(0/8)*), the mean is 6r(1 + +), the variance is 6?[P(1 + 
2) — (T(1 + 4))?], and the mode is (1 — 4)1/*. 
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Wishart 


The Wishart is the conjugate prior distribution for the inverse covariance matrix in a mul- 
tivariate normal distribution. It is a multivariate generalization of the gamma distribution. 
The integral is finite if the degrees of freedom parameter, v, is greater than or equal to 
the dimension, k. The density is finite if v > k+1. A noninformative distribution is ob- 
tained as v — 0. The sample covariance matrix for independent and identically distributed 
multivariate normal data has a Wishart distribution. In fact, multivariate normal simula- 
tions can be used to simulate a draw from the Wishart distribution, as follows. Simulate 
Q1,--.,@,, V independent samples from a k-dimensional multivariate N(0, S) distribution, 
then let 0 = X; aial. This only works when the distribution is proper; that is, v > k. 


Inverse- Wishart 


If W~! ~ Wishart, (S) then W has the inverse-Wishart distribution. The inverse- Wishart is 
the conjugate prior distribution for the multivariate normal covariance matrix. The inverse- 
Wishart density is always finite, and the integral is always finite. A degenerate form occurs 
when v < k. 


LKJ correlation 


The LKJ distribution (Lewandowski, Kurowicka, and Joe, 2009) is a distribution over 
positive-definite symmetric matrices with unit diagonals—that is, correlation matrices. If X 
is a correlation matrix, LkjCorr(=|n) x det(X)"~!, for with the parameter 7 required to be 
positive. The shape parameter 7 can be interpreted like the shape parameter of a symmetric 
beta distribution. If 7 = 1, then the density is uniform over all correlation matrices of a 
given order. If 7 > 1, the modal correlation matrix is the identity, with the distribution 
being more concentrated about this mode as 7 becomes large. For 0 < 7 < 1, the density 
has a trough at the identity matrix. 


t 


The ¢ (or Student-t) is the marginal posterior for the normal mean with unknown variance 
and conjugate prior and can be interpreted as a mixture of normals with common mean 
and variances that follow an inverse-gamma distribution. The t is also the ratio of a normal 
random variable and the square root of an independent gamma random variable. To simu- 
late t, simulate z from a standard normal and x from a y?, then let 6 = u + ozy/v/x. The 
t density is always finite; the integral is finite if v > 0 and ø is finite. In the limit v > ov, 
the t distribution approaches N(j,07). The case of v = 1 is called the Cauchy distribution. 
The ¢ distribution can be used in place of a normal in a robust analysis. 

To draw from the multivariate t, (u, £) distribution, generate a vector z ~ N(0,J) and 
a scalar x ~ x2, then compute u + Az./v/ax, where A satisfies AAT = X. 


Beta 


The beta is the conjugate prior distribution for the binomial probability. The density is 
finite if a, 8 > 1, and the integral is finite if a,6 > 0. The choice a = 6 = 1 gives the 
standard uniform distribution; a = 6 = 0.5 and a = 6 = 0 are also sometimes used as 
noninformative densities. To simulate 0 from the beta distribution, first simulate x, and 
ag from x3, and x3, distributions, respectively, then let 0 = Seer 

It is sometimes useful to estimate quickly the parameters of the beta distribution using 
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the method of moments: 


_ E@)A—-E)) 
OE var(0) =i 


a = (a + B)E(9), = (a + 8)(1— E(@)). (A.3) 


The kth order statistic from a sample of n independent U(0, 1) variables has the Beta(k, n— 
k + 1) distribution. 


Dirichlet 


The Dirichlet is the conjugate prior distribution for the parameters of the multinomial 
distribution. The Dirichlet is a multivariate generalization of the beta distribution. As 
with the beta, the integral is finite if all of the a’s are positive, and the density is finite if 
all are greater than or equal to one. 

The marginal distribution of a single 0; is Beta(a;,a9 — aj). The marginal distribution 
of a subvector of 0 is Dirichlet; for example (0;, 0;,1—0;—6;) ~ Dirichlet(a;,a;,a9—a;—a;). 
The conditional distribution of a subvector given the remaining elements is Dirichlet under 
the condition yy 6; =1. 

There are two standard approaches to sampling from a Dirichlet distribution. The fastest 
method generalizes the method used to sample from the beta distribution: draw 71,..., £k 
from independent gamma distributions with common scale and shape parameters Q1,..., Qk, 
and for each j, let 0; = «,;/ uae zi. A less efficient algorithm relies on the univariate 
marginal and conditional distributions being beta and proceeds as follows. Simulate 4; 
from a Beta(a1, Di3 ai) distribution. Then simulate 02,...,0,—-1 in order, as follows. 
For j = 2,...,k — 1, simulate ¢,; from a Beta(a,, 54 


i=j+1 ai) distribution, and let 0j = 
(1 — 27} 0;)¢;. Finally, set 6, =1— EE] 0. 


Constrained distributions 


We sometimes use notation such as Nt to convey the normal distribution constrained to be 
positive; that is, the truncated normal distribution. We also have occasion to use the half-t 
distribution, which is the right half of the t distribution. 


A.2 Discrete distributions 
Poisson 


The Poisson distribution is commonly used to represent count data, such as the num- 
ber of arrivals in a fixed time period. The Poisson distribution has an addition prop- 
erty: if 6; and 62 are independent with Poisson(A;) and Poisson(A2) distributions, then 
6, + 02 ~ Poisson(A; + A2). Simulation for the Poisson distribution (and most discrete 
distributions) can be cumbersome. Table lookup can be used to invert the cumulative 
distribution function. Simulation texts describe other approaches. 


Binomial 


The binomial distribution is commonly used to represent the number of ‘successes’ in a 
sequence of n independent and identically distributed Bernoulli trials, with probability of 
success p in each trial. A binomial random variable with large n is approximately normal. 
If 6; and 62 are independent with Bin(nı, p) and Bin(ng, p) distributions, then 6; + 62 ~ 
Bin(n; + n2,p). For small n, a binomial random variable can be simulated by obtaining n 
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independent standard uniforms and setting 0 equal to the number of uniform deviates less 
than or equal to p. For larger n, more efficient algorithms are often available in computer 
packages. When n = 1, the binomial is called the Bernoulli distribution. 


Multinomial 


The multinomial distribution is a multivariate generalization of the binomial distribution. 
The marginal distribution of a single 0; is binomial. The conditional distribution of a sub- 
vector of 0 is multinomial with ‘sample size’ parameter reduced by the fixed components 
of 0 and ‘probability’ parameters rescaled to have sum equal to one. We can simulate a 
multivariate draw using a sequence of binomial draws. Draw 6; from a Bin(n, pı) distri- 
bution. Then draw 62,...,9%—1 in order, as follows. For j = 2,...,k —1, draw 6; from a 
Bin(n — me bi, p;/ Se pi) distribution. Finally, set 0, = n — yo 0i. If at any time 
in the simulation the binomial sample size parameter equals zero, use the convention that 
a Bin(0, p) variable is identically zero. 


Negative binomial 


The negative binomial distribution is the marginal distribution for a Poisson random vari- 
able when the rate parameter has a Gamma(a, 3) prior distribution. The negative binomial 
can also be used as a robust alternative to the Poisson distribution, because it has the same 
sample space, but has an additional parameter. To simulate a negative binomial random 
variable, draw à ~ Gamma(a, 8) and then draw 0 ~ Poisson(A). In the limit a — ov, 
and a/ — constant, the distribution approaches a Poisson with parameter a/@. Under 
the alternative parameterization, p = sa the random variable 0 can be interpreted as 
the number of Bernoulli failures obtained before the œ successes, where the probability of 
success is p. 


Beta-binomial 


The beta-binomial arises as the marginal distribution of a binomial random variable when 
the probability of success has a Beta(a,() prior distribution. It can also be used as a 
robust alternative to the binomial distribution. The mixture definition gives an algorithm 
for simulating from the beta-binomial: draw ¢ ~ Beta(a, 3) and then draw @ ~ Bin(n, ¢). 


A.3 Bibliographic note 


Many software packages contain subroutines to simulate draws from these distributions. 
Texts on simulation typically include information about many of these distributions; for 
example, Gentle (2003) discusses simulation of all of these in detail, except for the LKJ 
distribution. Ripley (1987) is another helpful general book on simulation. Johnson and 
Kotz (1972) give more detail, such as the characteristic functions, for the distributions. 
Fortran and C programs for uniform, normal, gamma, Poisson, and binomial distributions 
are available in Press et al. (1986). 
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Outline of proofs of limit theorems 


The basic result of large-sample Bayesian inference is that as more and more data arrive, 
the posterior distribution of the parameter vector approaches multivariate normal. If the 
likelihood model happens to be correct, then we can also prove that the limiting posterior 
distribution is centered at the true value of the parameter vector. In this appendix, we 
outline a proof of the main results. The practical relevance of the theorems is discussed in 
Chapter 4. 

We derive the limiting posterior distribution in three steps. The first step is the conver- 
gence of the posterior distribution to a point, for a discrete parameter space. If the data 
truly come from the hypothesized family of probability models, the point of convergence will 
be the true value of the parameter. The second step applies the discrete result to regions 
in continuous parameter space, to show that the mass of the continuous posterior distri- 
bution becomes concentrated in smaller and smaller neighborhoods of a particular value 
of parameter space. Finally, the third step of the proof shows the accuracy of the normal 
approximation in the vicinity of the posterior mode. 


Mathematical framework 


The key assumption for the results presented here is that data are independent and identi- 
cally distributed: we label the data as y = (y1,.--,Yn), with probability density Į J;—; f(y). 
We use the notation f(-) for the true distribution of the data, in contrast to p(-|@), the dis- 
tribution of our probability model. The data y may be discrete or continuous. 

We are interested in a (possibly vector) parameter 0, defined on a space ©, for which 
we have a prior distribution, p(@), and a likelihood, p(y|@) = [];_, p(yi|@), which assumes 
the data are independent and identically distributed. As illustrated in the counterexamples 
discussed in Section 4.3, some conditions are required on the prior distribution and the 
likelihood, as well as on the space O, for the theorems to hold. 

It is necessary to assume a true distribution for y, because the theorems only hold in 
probability; for almost every problem, it is possible to construct data sequences y for which 
the posterior distribution of @ will not have the desired limit. The theorems are of the form, 
‘The posterior distribution of 0 converges in probability (as n — co) to...’; the ‘probability’ 
is with respect to f(y), the true distribution of y. 

We label ĝo as the value of 0 that minimizes the Kullback-Leibler divergence KL(@) of the 
distribution p(-|@) in the model relative to the true distribution, f(-). The Kullback-Leibler 
divergence is defined at any value 0 by 


cur oe ith) = foe, ane 


This is a measure of ‘discrepancy’ between the model distribution p(y;|@) and the true 
distribution f(y), and ĝo may be thought of as the value of 0 that minimizes this distance. 
We assume that o is the unique minimizer of KL(@). It turns out that as n increases, the 
posterior distribution p(@|y) becomes concentrated about 6p. 


587 
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Suppose that the likelihood model is correct; that is, there is some true parameter value 
0 for which f(yi) = p(yi|0). In this case, it is easily shown via Jensen’s inequality that 
(B.1) is minimized at the true parameter value, which we can then label as 09 without risk 
of confusion. 


Convergence of the posterior distribution for a discrete parameter space 


Theorem. If the parameter space © is finite and Pr(0 = 00) > 0, then Pr(@ = Oo|y) > 1 as 
n — oo, where ĝo is the value of 0 that minimizes the Kullback-Leibler divergence (B.1). 


Proof. We will show that p(@|y) > 0 as n > oo for all 0 Æ 0o. Consider the log posterior 


odds relative to o: 
ply) \ _ p(9) 
« (Fenn) =" (Fey 


The second term on the right is a sum of n independent identically distributed random 
variables, if 0 and o are considered fixed and the y;’s are random with distributions f. 
Each term in the summation has a mean of 


(u(t) ata, 


) + Doe ( 20 ). (B.2) 


P(yi|9o) 


which is zero if 0 = 69 and negative otherwise, as long as 9 is the unique minimizer of 
KL(@). 

Thus, if 6 # 0o, the second term on the right of (B.2) is the sum of n independent 
identically distributed random variables with negative mean. By the law of large numbers, 
the sum approaches —oo as n — oo. As long as the first term on the right of (B.2) is finite 
(that is, as long as p(o) > 0), the whole expression approaches —oo in the limit. Then, 
p(6|y)/p(Ooly) — 0, and so p(@|y) > 0. Since all probabilities sum to 1, p(@|y) > 1. 


Convergence of the posterior distribution for a continuous parameter space 


If 6 has a continuous distribution, then p(0o|y) is always zero for any finite sample, and 
so the above theorem cannot apply. We can, however, show that the posterior probability 
distribution of 0 becomes more and more concentrated about 09 as n —> oo. Define a 
neighborhood of ĝo as the open set of all points in © within a fixed nonzero distance of 4. 


Theorem. If 0 is defined on a compact set and A is a neighborhood of ĝo with nonzero 
prior probability, then Pr(0 € Aly) + 1 as n > oo, where ĝo is the value of 0 that minimizes 
(B.1). 

Proof. The theorem can be proved by placing a small neighborhood about each point in 
O, with A being the only neighborhood that includes 09, and then covering © with a finite 
subset of these neighborhoods. If © is compact, such a finite subcovering can always be 
obtained. The proof of the convergence of the posterior distribution to a point is then 
adapted to show that the posterior probability for any neighborhood except A approaches 
zero as n — oo, and thus Pr(0 € Aly) > 1. 


Convergence of the posterior distribution to normality 


We just showed that by increasing n, we can put as much of the mass of the posterior 
distribution as we like in any arbitrary neighborhood of 69. Obtaining the limiting poste- 
rior distribution requires two more steps. The first is to show that the posterior mode is 
consistent; that is, that the mode of the posterior distribution falls within the neighborhood 
where almost all the mass lies. The second step is a normal approximation centered at the 
posterior mode. 
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Theorem. Under some regularity conditions (notably that 69 not be on the boundary of 
©), as n —> oo, the posterior distribution of 6 approaches normality with mean ĝo and 
variance (nJ(09))~', where 6o is the value that minimizes the Kullback-Leibler divergence 
(B.1) and J is the Fisher information (2.20). 


Proof. For convenience in exposition, we first derive the result for a scalar 0. Define 6 as 
the posterior mode. The proof of the consistency of the maximum likelihood estimate (see 
the bibliographic note at the end of the chapter) can be mimicked to show that Ê is also 
consistent; that is Ê — 0o as n — ov. 

Given the consistency of the posterior mode, we approximate the log posterior density 
by a Taylor expansion centered about 6, confident that (for large n) the neighborhood near 
Ê has almost all the mass in the posterior distribution. The normal approximation for @ is 
a quadratic approximation for the log posterior distribution of 6, a form that we derive via 
a Taylor series expansion of log p(6|y) centered at 6: 


$ 1 no O2 1 no Ge 
log p(y) = log ply) + 50 — 0) [log p(O|y)]o-9 + z- 0)? se [log p(Aly)o—a + °° 


(The linear term in the expansion is zero because the log posterior density has zero derivative 
at its interior mode.) 

Consider the above equation as a function of 0. The first term is a constant. The 
coefficient for the second term can be written as 


d d? 7 wode 
qg 8PC) = ggz 298P) + à gg Hog pluil loa 


which is a constant plus the sum of n independent identically distributed random variables 
with negative mean (once again, it is the y;’s that are considered random here). If f(y) = 
p(y|90) for some ĝo, then the terms each have mean —J (0o). If the true data distribution 


f(y) is not in the model class, then the mean is By (z log p(yl0)) evaluated at 0 = 0o, 


which is the negative second derivative of the Kullback-Leibler divergence, KL(09), and is 
thus negative, because ĝo is defined as the point at which KL(@) is minimized. Thus, the 
coefficient for the second term in the Taylor expansion increases with order n. A similar 
argument shows that coefficients for the third- and higher-order terms increase no faster 
than order n. 

We can now prove that the posterior distribution approaches normality. As n — oo, 
the mass of the posterior distribution p(@|y) becomes concentrated in smaller and smaller 
neighborhoods of #9, and the distance |ô — o| also approaches zero. Thus, in consider- 
ing the Taylor expansion about the posterior mode, we can focus on smaller and smaller 
neighborhoods about 6. As |8 — ô| — 0, the third-order and succeeding terms of the Taylor 
expansion fade in importance, relative to the quadratic term, so that the distance between 
the quadratic approximation and the log posterior distribution approaches 0, and the normal 
approximation becomes increasingly accurate. 


Multivariate form 
If 0 is a vector, the Taylor expansion becomes 
2 


. es 
log p(4|y) = log (Aly) + 5(8 - ÔT 


de? [log p(0ly)]o—ő (0 = 6) free, 


where the second derivative of the log posterior distribution is now a matrix whose expec- 
tation is the negative of a positive definite matrix which is the Fisher information matrix 
(2.20) if f(y) = p(y|A0) for some 8o. 
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B.1 Bibliographic note 


The asymptotic normality of the posterior distribution was known by Laplace (1810) but 
first proved rigorously by Le Cam (1953); a general survey of previous and subsequent 
theoretical results in this area is given by Le Cam and Yang (1990). Like the central 
limit theorem for sums of random variables, the consistency and asymptotic normality of 
the posterior distribution also hold in far more general conditions than independent and 
identically distributed data. The key condition is that there be ‘replication’ at some level, 
as, for example, if the data come in a time series whose correlations decay to zero. 

The Kullback-Leibler divergence comes from Kullback and Leibler (1951). Chernoff 
(1972, Sections 6 and 9.4) has a clear presentation of consistency and limiting normality 
results for the maximum likelihood estimate. Both proofs can be adapted to the posterior 
distribution. DeGroot (1970, Chapter 10) derives the asymptotic distribution for the pos- 
terior distribution in more detail; Shen and Wasserman (2001) provide more recent results 
in this area. 
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Computation in R and Stan 


We illustrate some practical issues of simulation by fitting a single example—the hierarchical 
normal model for the eight schools described in Section 5.5. After some background in 
Section C.1, we show in Section C.2 how to fit the model using the Bayesian inference 
package Stan, operating from within the general statistical package R. Sections C.3 and C.4 
present several different ways of programming the model directly in R. These algorithms 
require programming efforts that are unnecessary for the Stan user but are useful knowledge 
for programming more advanced models for which Stan might not work. We conclude in 
Section C.5 with some comments on practical issues of programming and debugging. It 
may also be helpful to read the computational tips in Section 10.7 and the discussion of 
Hamiltonian Monte Carlo and Stan in Sections 12.4-12.6. 


C.1 Getting started with R and Stan 


Go to http://www.r-project.org/ and http://mc-stan.org/. Further information in- 
cluding links to help lists are available at these webpages. We anticipate continuing improve- 
ments in both packages in the years after this book is released, but the general computational 
strategies presented here should remain relevant. 


R is a general-purpose statistical package that is fully programmable and also has avail- 
able a large range of statistical tools, including flexible graphics, simulation from probability 
distributions, numerical optimization, and automatic fitting of many standard probability 
models including linear regression and generalized linear models. For Bayesian computa- 
tion, one can directly program Gibbs and Metropolis algorithms (as we illustrate in Section 
C.3) or Hamiltonian Monte Carlo (as shown in Section C.4). Computationally intensive 
tasks can be programmed in Fortran or C and linked from R. 


Stan is a high-level language in which the user specifies a model and has the option to 
provide starting values, and then a Markov chain simulation is automatically implemented 
for the resulting posterior distribution. It is possible to set up and fit models entirely 
within Stan, but in practice it is almost always necessary to process data before entering 
them into a model, and to process the inferences after the model is fitted, and so we run 
Stan by calling it from R using the stan() function, as illustrated in Section C.2. Again, 
the details of these function calls might change as Stan continues to be developed, so refer 
to http: //mc-stan.org/ for the latest documentation. 


When working in R and Stan, it is helpful to set up the computer to simultaneously 
display four windows: the R console, an R graphics window, a text editor with the R script, 
and a text editor with Stan code. Rather than typing directly into R, we prefer to enter 
the R code into the editor and then source the file to run the commands in the R console. 
Using the text editor is convenient because it allows more flexibility in writing functions and 
loops. Another alternative is to use a workspace such as RStudio (http://rstudio.org/) 
which maintains several windows within a single environment. 


591 


This electronic edition is for non-commercial purposes only. 


592 C. COMPUTATION IN R AND STAN 
C.2 Fitting a hierarchical model in Stan 


In this section, we describe all the steps by which we would use Stan to fit the hierarchical 
normal model to the educational testing experiments in Section 5.5. These steps include 
writing the model in Stan and using R to set up the data and starting values, call Stan, 
create predictive simulations, and graph the results. 


Stan program 


The hierarchical model can be written in Stan in the following form, which we save as a 
file, schools.stan, in our working directory: 


data { 
int<lower=0> J; // number of schools 
real y[J]; // estimated treatment effects 


real<lower=0> sigma[J]; // s.e.’s of effect estimates 


} 


parameters { 


real mu; // population mean 
real<lower=0> tau; // population sd 
vector[J] eta; // school-level errors 

} 

transformed parameters { 
vector[J] theta; // school effects 
theta <- mu + tauxeta; 

} 

model { 


eta ~ normal(0, 1); 
y ~ normal(theta, sigma); 


} 
The first paragraph of the above code specifies the data: the number of schools, J; the 
estimates, y1, ..., yJ; and the standard errors, 01,...0 7. Data are labeled as integer or real 


and can be vectors (or, more generally, arrays) if dimensions are specified. Data can also 
be constrained; for example, in the above model J has been restricted to be nonnegative 
and the components of o, must all be positive. 

The code next introduces the parameters: the unknowns to be estimated in the model 
fit. These are the school effects, 6;; the mean, u, and standard deviation, T, of the popula- 
tion of school effects, the school-level errors 7, and the effects, 6. In this model, we let 0 be 
a transformation of u, 7, and 7 instead of directly declaring 0 as a parameter. By parame- 
terizing this way, the sampler runs more efficiently; the resulting multivariate geometry is 
better behaved for Hamiltonian Monte Carlo. 

Finally comes the model, which looks similar to how it would be written in this book. 
(Just be careful: in our book, the second argument to the N(-,-) distribution is the vari- 
ance; Stan parameterizes using the standard deviation.) We have written the model in 
vector notation, which is cleaner and also runs faster in Stan by making use of more 
efficient autodifferentiation. It would also be possible to write the model more explic- 
itly, for example replacing y ~ normal(theta,sigma); with a loop over the J schools, 
for (j in 1:J) y[j] ~ normal(theta[j],sigma[j]);. 


R script for data input, starting values, and running Stan 


We put the data into a file, schools.csv, in the R working directory, with headers describ- 
ing the data: 
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school, estimate, sd 
; 28, 15 
s 8, 10 


H, 12, 18 
From R, we then execute the following script to read in the data: 


schools <- read.csv("schools.csv", header=TRUE) 
J <- nrow(schools) 

y <- schools$estimate 

sigma <- schools$sd 


We load in the rstan package: 
library ("rstan") 


We now run Stan with 4 chains of 1000 iterations each and display the results numerically 
and graphically: 


schools_fit <- stan(file="schools.stan", 
data=c("J","y","sigma"), iter=1000, chains=4) 

print (schools_fit) 

plot (schools_fit) 


When the computations are finished, summaries of the inferences and convergence are 
displayed in the R console (see Figure C.1) and in an R graphics window (not shown here). 

In this example, the sequences appear to have mixed well—the estimated potential scale 
reduction factor R is below 1.1 for all the parameters and quantities of interest displayed. 

Stan uses a stochastic algorithm and so results will not be identical when re-running it. 
For example, here is the line of output for the parameter 4; in our first Stan run (repeated 
from Figure C.1): 


mean se_mean sd 25% 50% 75% n_eff Rhat 
theta[1] 12.1 1.3 11.1 5.8 10.0 15.4 72 1 


and here is the corresponding result from the second run: 


mean se_mean sd 25% 50% 75% n_eff Rhat 
theta[1] 11.4 0.3 8.2 5.8 10.4 15.7 830 i: 


The inferences are similar but not identical. The simulation estimate for E(@;|y) is 12.1 
under one simulation and 11.4 under the other, not much of a difference considering that 
the posterior standard deviation is about 10 (more precisely, estimated to be 11.0 under 
one simulation and 8.2 under the other). The quantiles have the same general feel, but 
one must beware of overinterpretation. For example, the 95% posterior interval for 0ı 
is [—3.0, 40.1] in one simulation and [—1.9,31.5] in the other. (The 95% interval can be 
obtained from the function call, print (schools_fit,"theta[1]",probs=c(.025, .975)); 
it is not shown in the default display in Figure C.1.) In practice, the intervals from the two 
different simulation runs contain simular information but their variability indicates that, 
even after approximate convergence, the tail quantiles of posterior quantities have a fair 
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Inference for Stan model: schools. 
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000. 


mean se_mean sd 25% 50% 75% n_eff Rhat 


mu 7.4 0.2 4.8 4.5 7.4 10.5 534 1 
tau 6.9 0.5 6.1 2.4 5.4 9.3 138 1 
eta[1] 0.4 0.1 0.9 -0.2 0.4 1.1 332 1 
eta[2] 0.0 0.0 0.9 -0.5 0.1 0.6 1052 1 
eta[3] -0.2 0.0 1.0 -0.9 -0.3 0.4 820 1 
eta[4] 0.0 0.0 0.8 -0.5 =-O.1: 0.5 848 1 
eta[5] -0.3 0.0 0.8 -0.9 -0.3 0.1 1051 1 
eta[6] -0.2 0.0 0.9 -0.8 -0.2 0.4 676 1 
eta[7] 0.3 0.0 0.9 -0.2 0.4 1.0 793 1 
eta[8] 0.1 0.0 0.9 -0.6 0.1 0.6 902 1 
theta[1] 12.1 1.3 11.1 5.8 10.0 15.4 72 1 
theta[2] 7.8 0.2 5.9 3.9 7.7 11.6 934 1 
theta[3] 4.8 0.5 9.0 1.0 6.2 10.3 301 1 
theta[4] 7.0 0.3 6.7 3.0 6.9 11.3 512 1 
theta[5] 4.5 0.3 6.4 0.2 5.0 8.9 604 1 
theta[6] 5.6 0.6 7.7 1.9 6.5 10.3 142 1 
theta[7] 10.5 0.3 7.0 5.4 9.8 14.6 636 1 
theta[8] 8.3 0.4 8.2 3.6 8.0 12.8 532 1 
lp__ -4.9 0.2 2.6 -6.6 -4.8 -3.0 201 1 


Samples were drawn using NUTS2 at Wed Apr 24 13:36:13 2013. 

For each parameter, n_eff is a crude measure of effective sample size, 
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1). 


Figure C.1 Numerical output from the print () function applied to the Stan code of the hierarchical 
model for the educational testing example. For each parameter, mean is the estimated posterior 
mean (computed as the average of the saved simulation draws), semean is the estimated standard 
error (that is, Monte Carlo uncertainty) of the mean of the simulations, and sd is the standard 
deviation. Thus, as the number of simulation draws approaches infinity, semean approaches zero 
while sd approaches the posterior standard deviation of the parameter. Then come several quantiles, 
then the effective sample size neg (formula (11.8) on page 287) and the potential scale reduction 
factor R (see (11.4) on page 285). When all the simulated chains have mized, R=1. Beyond 
this, the effective sample size and standard errors give a sense of whether the simulations suffice 
for practical purposes. Each line of the table shows inference for a single scalar parameter in the 
model, with the last line displaying inference for the unnormalized log posterior density calculated 
at each step in Stan. 


amount of simulation variability. The importance of this depends on what the simulation 
will be used for. 

Both simulations show good mixing (R ~ 1), but the effective sample sizes are much 
different. This sort of variation is expected, as neg is itself a random variable estimated from 
simulation draws. The simulation with higher effective sample size has a lower standard 
error of the mean and more stable estimates. 


Accessing the posterior simulations in R 


The output of the R function stan() is an object from which can be extracted various 
information regarding convergence and performance of the algorithm as well as a matrix of 
simulation draws of all the parameters, following the basic idea of Figure 1.1 on page 24. 


This electronic edition is for non-commercial purposes only. 


C.2. FITTING A HIERARCHICAL MODEL IN STAN 595 


For example: 


schools_sim <- extract(schools_fit1) 


The result is a list with four elements corresponding to the five quantities saved in the 
model: theta, eta, mu, tau, lp__. The vector 0 of length 8 becomes a 20,000 x 8 matrix of 
simulations, the vector 7 similarly becomes a 20,000 x 8 matrix, the scalars u and T each 
become a vector of 20,000 draws, and the 20,000 draws of the unnormalized log posterior 
density are saved as the fourth element of the list. 

For example, we can display posterior inference for T: 


hist (schools_sim$tau) 


Or compute the posterior probability that the effect is larger in school A than in school C: 


mean(schools_sim$theta[,1] > schools_sim$theta[,3]) 


Posterior predictive simulations and graphs in R 


Replicated data in the existing schools. Having run Stan to successful convergence, we can 
work directly in R with the saved parameters, 0, u, T. For example, we can simulate poste- 
rior predictive replicated data in the original 8 schools: 


n_sims <- length(schools_sim$lp__) 
y_rep <- array(NA, c(n_sims, J)) 
for (s in 1:n_sims) 
y_rep[s,] <- rnorm(J, schools_sim$thetal[s,], sigma) 


We now illustrate a graphical posterior predictive check. There are not many ways to 
display a set of eight numbers. One possibility is as a histogram; the possible values of y*°P 
are then represented by an array of histograms as in Figure 6.2 on page 144. In R, this 
could be programmed as 


par (mfrow=c(5,4), mar=c(4,4,2,2)) 
hist(y, xlab="", main="y") 
for(s in 1:19) 
hist(y_rep[s,], xlab="", main=paste("y_rep",s)) 


The upper-left histogram displays the observed data, and the other 19 histograms are 
posterior predictive replications, which in this example look similar to the data. 

We could also compute a numerical test statistic such as the difference between the best 
and second-best of the 8 coaching programs: 


test <- function(y){ 
y_sort <- rev(sort(y)) 
return(y_sort[1] - y_sort[2]) 
} 
t_y <- test(y) 
t_rep <- rep(NA, n_sims) 
for(s in 1:n_sims) 
t_rep[s] <- test(y_repl[s,]) 


We then can summarize the posterior predictive check. The following R code gives a 
numerical comparison of the test statistic to its replication distribution, a p-value, and a 
graph like those on pages 144 and 148: 
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par (mfrow=c(1,1)) 
cat("T(y) =", round(t_y,1), " and T(y_rep) has mean", 

round(mean(t_rep),1), "and sd", round(sd(t_rep) ,1), 

"\nPr (T(y_rep) > T(y)) =", round(mean(t_rep>t_y),2), "\n") 
histO <- hist(t_rep, xlim=range(t_y,t_rep), xlab="T(y_rep)") 
lines(rep(t_y,2), c(0,1e6)) 
text(t_y, .9*max(histO$count), "T(y)", adj=0) 


Replicated data in new schools. As discussed in Section 6.5, another form of replication 
would simulate new parameter values and new data for eight new schools. To simulate data 
yj ~ N(O;, o?) from new schools, it is necessary to make some assumption or model for the 
data variances o}. For the purpose of illustration, we assume these are repeated from the 
original 8 schools. 


theta_rep <- array (NA, c(n_sims, J)) 

y_rep <- array(NA, c(n_sims, J)) 

for (s in 1:n_sims){ 
theta_rep[s,] <- rnorm(J, schools_sim$mu[s], schools_sim$tau[s]) 
y_rep[s,] <- rnorm(J, theta_rep[s,], sigma) 

} 


Numerical and graphical comparisons can be performed as before. 


Alternative prior distributions 


The model as programmed above has nearly uniform prior distributions on the hyperpa- 
rameters jg and og. An alternative is a half-Cauchy for og, which we could implement by 
taking the Stan model on page 592 and adding the line, tau ~ cauchy(0,25);. 

We can fit the model as before. This new hyperprior distribution leads to changed 
inferences. In particular, the posterior mean and median of 7 are lower and shrinkage of 
the 6;’s is greater than in the previously fitted model with a uniform prior distribution 
on 7. To understand this, it helps to graph the prior density in the range for which the 
posterior distribution is substantial. Figure 5.9 on page 131 shows that the prior density is 
a decreasing function of 7 which has the effect of shortening the tail of the posterior density. 


Using the t model 


It is straightforward to expand the hierarchical normal distribution for the coaching effects 
to a t distribution as discussed in Section 17.4, by replacing eta ~ normal(0,1); with 
eta ~ student_t(nu,0,1); and declaring nu as a parameter that takes on a value of 1 or 
greater (real<lower=1> nu;) and assigning it a prior distribution. 


C.3 Direct simulation, Gibbs, and Metropolis in R 


In this section we demonstrate several different ways to fit the 8-schools model by directly 
programming the computations in R. 


Marginal and conditional simulation for the normal model 


We begin by programming the calculations in Section 5.4. The programs provided here 
return to the notation of Chapter 5 (for example, 7 is the population standard deviation 
of the @’s) as this allows for easy identification of some of the variables in the programs 
(for example, mu_-hat and V_mu are the quantities denoted by the corresponding symbols in 
(5.20)). 
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We assume that the dataset has been read into R as in Section C.2, with J the number 
of schools, y the vector of data values, and sigma the vector of standard deviations. Then 
the first step of our programming is to set up a grid for 7, evaluate the marginal posterior 
distribution (5.21) for T at each grid point, and sample 1000 draws from the grid. The 
grid here is n_grid=2000 points equally spread from 0 to 40. Here we use the grid as a 
discrete approximation to the posterior distribution of r. We first define & and V, of (5.20) 
as functions of 7 and the data, as these quantities are needed here and in later steps, and 
then compute the log density for T. 


mu_hat <- function(tau, y, sigma){ 
sum(y/(sigma"2 + tau~2))/sum(1/(sigma*2 + tau~2)) 
} 
V_mu <- function(tau, y, sigma){ 
1/sum(1/(tau*2 + sigma^2)) 
} 
n_grid <- 2000 
tau_grid <- seq(.01, 40, length=n_grid) 
log_p_tau <- rep(NA, n_grid) 
for (i in 1:n_grid){ 
mu <- mu_hat(tau_grid[i], y, sigma) 
V <- V_mu(tau_grid[i], y, sigma) 
log_p_tau[i] <- .5*log(V) - .5*sum(log(sigma*2 + tau_grid[i]^2)) - 
.5*sum((y-mu)*2/(sigma*2 + tau_grid[i]^2)) 
} 


We compute the posterior density for T on the log scale and rescale it to eliminate the possi- 
bility of computational overflow or underflow that can occur when multiplying many factors. 


log_p_tau <- log_p_tau - max(log_p_tau) 

p_tau <- exp(log_p_tau) 

p_tau <- p_tau/sum(p_tau) 

n_sims <- 1000 

tau <- sample(tau_grid, n_sims, replace=TRUE, prob=p_tau) 


The last step draws the simulations of 7 from the approximate discrete distribution. The 
remaining steps are sampling from normal conditional distributions for u and the 0;’s as in 
Section 5.4. The sampled values of the eight 6;’s are collected in an array. 


mu <- rep(NA, n_sims) 

theta <- array(NA, c(n_sims,J)) 

for (i in 1:n_sims){ 
mu[i] <- rnorm(1, mu_hat(tauli],y,sigma), sqrt(V_mu(tau[i] ,y,sigma))) 
theta_mean <- (mu[i]/tau[i]“2 + y/sigma*2)/(1/tau[i]“2 + 1/sigma*2) 
theta_sd <- sqrt(1/(1/tau[i]*2 + 1/sigma*2)) 
theta[i,] <- rnorm(J, theta_mean, theta_sd) 

} 


We now have created 1000 draws from the joint posterior distribution of 7, 4,0. Posterior 
predictive distributions are easily generated using the random number generation capabili- 
ties of R as described above in the Stan context. 


Gibbs sampler for the normal model 


Another approach, actually simpler to program, is to use the Gibbs sampler. This com- 


putational approach follows the outline of Section 11.6 with the simplification that the 


observation variances o? are known. 
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theta_update <- function(){ 
theta_hat <- (mu/tau*2 + y/sigma*2)/(1/tau*2 + 1/sigma*2) 
V_theta <- 1/(1/tau"2 + 1/sigma*2) 
rnorm(J, theta_hat, sqrt (V_theta) ) 
} 
mu_update <- function(){ 
rnorm(1, mean(theta), tau/sqrt(J)) 
} 
tau_update <- function(){ 
sqrt (sum((theta-mu) *2)/rchisq(1,J-1)) 
} 


We now generate five independent Gibbs sampling sequences of length 1000. We initial- 
ize u and 7 with overdispersed values based on the range of the data y and then run the 
Gibbs sampler, saving the output in a large array, sims, that contains posterior simulation 
draws for 6, u, T. 


chains <- 5 
iter <- 1000 
sims <- array(NA, c(iter, chains, J+2)) 
dimnames (sims) <- list (NULL, NULL, 
c(paste("theta[", 1:8, "]", sep=""), "mu", "tau")) 
for (m in 1:chains){ 
mu <- rnorm(1, mean(y), sd(y)) 
tau <- runif(1, 0, sd(y)) 
for (t in 1:iter){ 
theta <- theta_update() 
mu <- mu_update() 
tau <- tau_update() 
sims[t,m,] <- c(theta, mu, tau) 
} 
} 


We then check the mixing of the sequences using the R function monitor that carries 
out the convergence diagnostic and effective sample size computation described in Section 
11.4: 


monitor (sims) 


The monitor function is part of the rstan package and thus is already loaded if you have en- 
tered library ("rstan") in your current R session. The function takes as input an array of 
posterior simulations from multiple chains, and it returns an estimate of the potential scale 
reduction R, effective sample size neg, and summary statistics for the posterior distribution 
(based on the last half of the simulated Markov chains). 

The model can also be computed using alternative parameterizations. For example, in 
a parameter-expanded model, the Gibbs sampler steps can be programmed as 


gamma_update <- function(){ 
gamma_hat <- (alpha*(y-mu)/sigma*2)/(1/tau*2 + alpha*2/sigma*2) 
V_gamma <- 1/(1/tau~2 + alpha*2/sigma*2) 
rnorm(J, gamma_hat, sqrt (V_gamma) ) 

} 

alpha_update <- function(){ 
alpha_hat <- sum(gamma* (y-mu)/sigma^2)/sum(gamma^2/sigma^2) 
V_alpha <- 1/sum(gamma^2/sigma^2) 
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rnorm(1, alpha_hat, sqrt (V_alpha) ) 
F 
mu_update <- function(){ 
mu_hat <- sum((y-alpha*gamma)/sigma^2)/sum(1/sigma^2) 
V_mu <- 1/sum(1/sigma^2) 
rnorm(1, mu_hat, sqrt(V_mu)) 
} 
tau_update <- function(){ 
sqrt (sum(gamma*2)/rchisq(1,J-1)) 
} 


The Gibbs sampler can then be implemented as 


sims <- array(NA, c(iter, chains, J+2)) 
dimnames (sims) <- list (NULL, NULL, 
c(paste("theta[", 1:8, "]", sep=""), "mu", "tau")) 
for (m in 1:chains){ 
alpha <- 1 
mu <- rnorm(1, mean(y), sd(y)) 
tau <- runif(1, 0, sd(y)) 
for (t in 1:iter){ 
gamma <- gamma_update() 
alpha <- alpha_update() 
mu <- mu_update() 
tau <- tau_update() 
sims[t,m,] <- c(mu + alpha*gamma, mu, abs(alpha)*tau) 
} 
} 


monitor (sims) 


Gibbs sampling for the t model with fixed degrees of freedom 


As described in Chapter 17, the t model can be implemented using the Gibbs sampler 
using the normal-inverse-x* parameterization for the 6;’s and their variances. Following 
the notation of that chapter, we take V; to be the variance for 6; and model the V;’s as 
draws from an inverse-y? distribution with degrees of freedom v and scale r. As with the 
normal model, we use a uniform prior distribution on (1,7). 

As before, we first create the separate updating functions, including a new function to 
update the individual-school variances V;. 


theta_update <- function(){ 
theta_hat <- (mu/V + y/sigma*2)/(1/V + 1/sigma*2) 
V_theta <- 1/(1/V + 1/sigma*2) 
rnorm(J, theta_hat, sqrt (V_theta) ) 
} 
mu_update <- function(){ 
mu_hat <- sum(theta/V)/sum(1/V) 
V_mu <- 1/sum(1/V) 
rnorm(1, mu_hat, sqrt (V_mu)) 
By 
tau_update <- function(){ 
sqrt (rgamma(1, J*nu/2+1, (nu/2)*sum(1/V))) 
} 
V_update <- function(){ 
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(nu*tau*2 + (theta-mu)*2)/rchisq(J,nu+1) 
} 


Initially we fix the degrees of freedom at 4 to provide a robust analysis of the data. 


sims <- array(NA, c(iter, chains, J+2)) 
dimnames (sims) <- list (NULL, NULL, 
c(paste("theta[", 1:8, "]", sep=""), "mu", "tau")) 
nu <- 4 
for (m in 1:chains){ 
mu <- rnorm(1, mean(y), sd(y)) 
tau <- runif(1, 0, sd(y)) 
V <=- runif(J, 0, sd(y))72 
for (t in 1:iter){ 
theta <- theta_update() 
V <- V_update() 
mu <- mu_update() 
tau <- tau_update() 
sims[t,m,] <- c(theta, mu, tau) 
} 
} 


monitor (sims) 


Gibbs-Metropolis sampling for the t model with unknown degrees of freedom 


We can also include v, the degrees of freedom in the above analysis, as an unknown param- 
eter and update it conditional on all the others using the Metropolis algorithm. We follow 
the discussion in Chapter 17 and use a uniform prior distribution on (1,7, 1/v). 

To do Metropolis updating function, we write a function log_post to calculate the 
logarithm of the conditional posterior distribution of 1/v given all of the other parame- 
ters. (We work on the logarithmic scale to avoid computational overflows, as mentioned in 
Section 10.7.) The log posterior density function for this model has three terms—the loga- 
rithm of a normal density for the data points y;, the logarithm of a normal density for the 
school effects 0;, and the logarithm of an inverse-x? density for the variances Vj. Actually, 
only the last term involves v, but for generality we compute the entire log-posterior density: 


log_post <- function(theta, V, mu, tau, nu, y, sigma){ 
sum(dnorm(y, theta, sigma, log=TRUE)) + 
sum(dnorm(theta, mu, sqrt(V), log=TRUE)) + 
sum(.5*nu*log(nu/2) + nuxlog(tau) - 
lgamma(nu/2) - (nu/2+1)*log(V) - .5*nu*tau*2/V) 
} 


We introduce the function that performs the Metropolis step and then describe how 
to alter the R code given earlier to incorporate the Metropolis step. The following func- 
tion performs the Metropolis step for the degrees of freedom (recall that we work with 
the reciprocal of the degrees of freedom). The jumping distribution is normal with mean 
at the current value and standard deviation sigma_jump_nu (which is set as described be- 
low). We compute the jumping probability as described on page 278, setting it to zero if the 
proposed value of 1/v is outside the interval (0, 1] to ensure that such proposals are rejected. 


nu_update <- function(sigma_jump_nu){ 
nu_inv_star <- rnorm(1, 1/nu, sigma_jump_nu) 
if (nu_inv_star<=0 | nu_inv_star>1) 
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p_jump <- 0 
else { 


nu_star <- 1/nu_inv_star 
log_post_old <- log_post(theta, V, mu, tau, nu, y, sigma) 
log_post_star <- log_post (theta, V, mu, tau, nu_star,y,sigma) 
r <- exp(log_post_star - log_post_old) 
nu <- ifelse(runif(1) < r, nu_star, nu) 
p_jump <- min(r,1) 

} 

return(nu=nu, p_jump=p_jump) 


} 


This updating function stores the acceptance probability p_jump_nu which is used in adap- 
tively setting the jumping scale sigma_jump_nu, as we discuss when describing the Gibbs- 
Metropolis loop. 

Given these functions, it is relatively easy to modify the R code that we have already 
written for the t model with fixed degrees of freedom. When computing the Metropolis up- 
dates, we store the acceptance probabilities in an array, p_jump_nu, to monitor the efficiency 
of the jumping. Theoretical results given in Chapter 11 suggest that for a single parameter 
the optimal acceptance rate—that is, the average probability of successfully jumping—is 
approximately 44%. We can vary sigma_jump_nu in a pilot study to aim for this rate. For 
this example we can settle on a value such as sigma_jumpmu=1, which has an average 
jumping probability of about 0.4. 


sigma_jump_nu <- 1 
p_jump_nu <- array(NA, c(iter, chains)) 
sims <- array(NA, c(iter, chains, J+3)) 
dimnames (sims) <- list (NULL, NULL, 
c(paste("theta[", 1:8, "]", sep=""), "mu", "tau", "nu")) 
for (m in 1:chains){ 
mu <- rnorm(1, mean(y), sd(y)) 
tau <- runif(1, 0, sd(y)) 
V <- runif(J, 0, sd(y))72 
nu <- 1/runif(1, 0, 1) 
for (t in 1:iter){ 
theta <- theta_update() 
V <- V_update() 
mu <- mu_update() 
tau <- tau_update() 
temp <- nu_update(sigma_jump_nu) 
nu <- temp$nu 
p_jump_nu[t,m] <- temp$p_jump 
sims[t,m,] <- c(theta, mu, tau, nu) 
} 
} 
print (mean (p_jump_nu)) 
monitor (sims) 


Parameter expansion for the t model 


Finally, we can make the computations for the t model more efficient by applying param- 
eter expansion. In the expanded parameterization, the new Gibbs sampler steps can be 
programmed in R as 
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gamma_update <- function(){ 
gamma_hat <- (alpha*(y-mu)/sigma*2)/(1/V + alpha*2/sigma*2) 
V_gamma <- 1/(1/V + alpha*2/sigma*2) 
rnorm(J, gamma_hat, sqrt (V_gamma) ) 
} 
alpha_update <- function(){ 
alpha_hat <- sum(gamma*(y-mu) /sigma*2) /sum(gamma*2/sigma~2) 
V_alpha <- 1/sum(gamma~2/sigma*2) 
rnorm(1, alpha_hat, sqrt (V_alpha) ) 
} 
mu_update <- function(){ 
mu_hat <- sum((y-alpha*gamma) /sigma*2)/sum(1/sigma~2) 
V_mu <- 1/sum(1/sigma*2) 
rnorm(1, mu_hat, sqrt (V_mu)) 
J} 
tau_update <- function(){ 
sqrt (rgamma(1, J*nu/2+1, (nu/2)*sum(1/V))) 
} 
V_update <- function(){ 
(nu*tau*2 + gamma*2)/rchisq(J,nut1) 
} 
nu_update <- function(sigma_jump) { 
nu_inv_star <- rnorm(1, 1/nu, sigma_jump) 
if (nu_inv_star<=0 | nu_inv_star>1) 
p_jump <- 0 
else { 
nu_star <- 1/nu_inv_star 
log_post_old <- log_post(mutalpha*gamma, alpha*2*V, mu, 
abs(alpha)*tau, nu, y, sigma) 
log_post_star <- log_post (mutalpha*gamma, alpha*2*V, mu, 
abs(alpha)*tau, nu_star, y, sigma) 
r <- exp(log_post_star - log_post_old) 
nu <- ifelse(runif(1) < r, nu_star, nu) 
p_jump <- min(r,1) 
} 
return(nu=nu, p_jump=p_jump) 


} 


The posterior density can conveniently be calculated in terms of the original parameteriza- 
tion, as shown in the function nu_update() above. We can then run the Gibbs-Metropolis 
algorithm as before (see the program on the bottom part of page 601 and the top of page 
601), adding initialization steps for y and a just before the ‘for (t in 1:iter)’ loop: 


gamma <- rnorm(J, 0, 1) 
alpha <- rnorm(1, 0, 1) 


adding updating steps for y and a inside the loop, 


gamma <- gamma_update() 
alpha <- alpha_update() 


and replacing the last line inside the loop with simulations transformed to the original 6, u, T 
parameterization: 


sims[t,m,] <- cG@nutalpha*gamma, mu, abs(alpha)*tau, nu) 
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We must once again tune the scale of the Metropolis jumps. We started for convenience 
at sigma_jumpmnu= 1, and this time the average jumping probability for the Metropolis 
step is 17%. This is lower than the optimal rate of 44% for one-dimensional jumping, and 
so we would expect to get a more efficient algorithm by decreasing the scale of the jumps 
(see Section 12.2). Reducing sigma_jump_nu to 0.5 yields an average acceptance probability 
p-jump_nu of 32%, and sigma_jumpnu= 0.3 yields an average jumping probability of 46% 
and somewhat more efficient simulations—that is, the draws of v from the Gibbs-Metropolis 
algorithm are less correlated and yield a more accurate estimate of the posterior distribu- 
tion. Decreasing sigma_jump_nu any further would make the acceptance rate too high and 
reduce the efficiency of the algorithm. 


C.4 Programming Hamiltonian Monte Carlo in R 


We demonstrate Hamiltonian Monte Carlo (HMC) by programming the basic eight-schools 
model. For this particular problem, HMC is overkill but it might help to have this code as 
a template. 

We begin by reading in and setting up the data: 


schools <- read.csv("schools.csv", header=TRUE) 
J <- nrow(schools) 

y <- schools$estimate 

sigma <- schools$sd 


Our model has 10 parameters, which we string into a single vector which we label as th = 
(theta[1],...,theta[8],mu,tau). In the HMC program we work with this ten-dimen- 
sional vector, extracting its components as needed. First we program the log posterior 
density: 


log_p_th <- function(th, y, sigma){ 
J <- length(th) - 2 
theta <- th[1:J] 
mu <- th[J+1] 
tau <- th[J+2] 
if (is.nan(tau) | tau<=0) 
return (-Inf) 
else{ 
log_hyperprior <- 1 
log_prior <- sum(dnorm(theta, mu, tau, log=TRUE)) 
log_likelihood <- sum(dnorm(y, theta, sigma, log=TRUE) ) 
return(log_hyperprior + log_prior + log_likelihood) 
} 
} 


The scale parameter 7 is restricted under the model to be positive, hence the if-statement 
above which has the effect of setting the posterior density to zero if r jumps below zero. 

Next we program the analytical gradient, the derivative of the log posterior with respect 
to each parameter: 


gradient_th <- function(th, y, sigma){ 
J <- length(th) - 2 
theta <- th[1:J] 
mu <- th[J+1] 
tau <- th[J+2] 
if (tau<=0) 
return(c(0,0,0)) 
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else { 
d_theta <- - (theta-y)/sigma*2 - (theta-mu)/tau~2 
d_mu <- -sum(mu-theta)/tau~2 
d_tau <- -J/tau + sum((mu-theta)~2)/tau73 
return(c(d_theta, d_mu, d_tau)) 
} 
} 


If 7 is less than zero, we have set the gradient to zero. 
For debugging purposes we also write a numerical gradient function based on first dif- 
ferences: 


gradient_th_numerical <- function(th, y, sigma){ 
d <- length(th) 
e <- .0001 
diff <- rep(NA, d) 
for (k in 1:d){ 
th_hi <= th 
th_lo <= th 
th_hilk] <- th[k] +e 
th_lo[k] <- th[k] - e 
diff [k]<-(log_p_th(th_hi,y,sigma)-log_p_th(th_lo,y,sigma))/(2*e) 
} 
return (diff) 
} 


Next we program a single HMC iteration that takes as inputs the parameter vector 0, the 
data y, oy, the step size €, the number of leapfrog steps L per iteration, and a diagonal mass 
matrix, expressed as a vector, M: 


hmc_iteration <- function(th, y, sigma, epsilon, L, M) { 
M_inv <- 1/M 
d <- length(th) 
phi <- rnorm(d, 0, sqrt(M)) 
th_old <- th 
log_p_old <- log_p_th(th,y,sigma) - 0.5*sum(M_inv*phi~2) 
phi <- phi + 0.5*epsilon*gradient_th(th, y, sigma) 
for (1 in 1:L){ 
th <- th + epsilon*M_inv*phi 
phi <- phi + (if (1==L) 0.5 else 1)*epsilon*gradient_th(th,y,sigma) 
} 
phi <- -phi 
log_p_star <- log_p_th(th,y,sigma) - 0.5*sum(M_inv*phi~2) 
r <- exp(log_p_star - log_p_old) 
if (is.nan(r)) r <- 0 
p_jump <- min(r,1) 
th_new <- if (runif(1) < p_jump) th else th_old 
return(list(th=th_new, p_jump=p_jump)) 
} 


The above function performs the L leapfrog steps of an HMC iteration and returns the 

new value of 6 (which is the same as the old value if the trajectory was rejected) and the 

acceptance probability, which can be useful in monitoring the efficiency of the algorithm. 
Our next function is a wrapper that runs hmc_iteration and takes several arguments: 
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an m x d matrix of starting values (corresponding to m sequences and a vector of d pa- 
rameters to start each chain); the number of iterations to run each chain; the baseline step 
size co and number of steps Lo, and the mass vector M. After setting up empty arrays to 
store the results, our HMC function, hmc_run runs the chains one at a time. Within each 
sequence, € and L are randomly drawn at each iteration in order to mix up the algorithm 
and give it the opportunity to explore differently curved areas of the joint distribution. At 
the end, inferences are obtained using the last halves of the simulated sequences, summaries 
are printed out, and the simulations and acceptance probabilities are returned: 


hmc_run <- function(starting_values, iter, epsilon_O, L_0, M) { 
chains <- nrow(starting_values) 
d <- ncol(starting_values) 
sims <- array(NA, c(iter, chains, d), 
dimnames=list (NULL, NULL, colnames(starting_values) )) 
warmup <- 0.5*iter 
p_jump <- array(NA, c(iter, chains) ) 
for (j in 1:chains){ 
th <- starting_values[j,] 
for (t in 1:iter){ 
epsilon <- runif(1, 0, 2*epsilon_0) 
L <- ceiling (2*L_O*runif (1)) 
temp <- hmc_iteration(th, y, sigma, epsilon, L, M) 
p_jump[t,j] <- temp$p_jump 
sims[t,j,] <- temp$th 
th <- temp$th 
} 
} 
monitor(sims, warmup) 
cat("Avg acceptance probs:", 
fround(colMeans(p_jump[(warmupt+1):iter,]),2),"\n") 
return(list(sims=sims, p_jump=p_jump) ) 


} 


Now it is time to get ready to run the algorithm. We define a vector with the names of the 
parameters and set the number of chains to 4: 


parameter_names <- c(paste("theta[",1:8,"]",sep=""), "mu", "tau") 
d <- length(parameter_names) 
chains <- 4 


Next we define a diagonal mass matrix to be on the rough scale of the inverse variance 
matrix of the posterior distribution. Given the estimates and standard errors in Table 5.2 
on page 120, we can crudely approximate this scale as 15 for each of the parameters; thus, 


mass_vector <- rep(1/15°2, d) 


Then we set up an array of random starting points defined on roughly the same scale, being 
careful to restrict the starting points for T to be positive: 


starts <- array(NA,c(chains,d) ,dimnames=list (NULL, parameter_names) ) 
for (j in 1:chains){ 

starts[j,] <- rnorm(d,0,15) 

starts[j,10] <- runif(1,0,15) 
} 


We are finally ready to go! We start with our default values €o = 0.1, Lo = 10, running for 
only 20 iterations to make sure the program does not crash: 
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M1 <- hmc_run (starting_values=starts, iter=20, 
epsilon_0=.1, L_0=10, M=mass_vector) 


The program runs fine so we go back for 100 iterations: 


M2 <- hmc_run(starting_values=starts, iter=100, 
epsilon_0=.1, L_0=10, M=mass_vector) 


Here are the results: 


Inference for the input samples (4 chains: each with iter=100; warmup=50): 


mean se_mean sd 25% 50% 75% n_eff Rhat 


theta[1i] 9.8 2.4 7.0 3.6 9.7 14.6 9 1.4 
theta[2] 6.9 0.9 5.6 2.3 8.0 10.4 37 1.1 
theta[3] 6.7 1.0 6.8 2.0 6.5 10.7 44 1.1 
theta[4] 8.3 1.4 7.1 4.0 9.4 11.5 25 1.1 
theta[5] 4.9 1.65.4 2.65.6 8.7 12 1.1 
theta[6] 4.5 0.8 4.7 2.0 5.0 8.1 36 1.1 
theta[7] 9.6 1.16.3 4.9 9.3 11.9 30 1.1 
theta[8] 9.2 1.4 7.2 5.5 9.4 13.6 28 1.1 
mu 7.4 0.7 3.9 4.4 7.2 10.2 27 1.1 
tau 6.9 1.84.4 4.06.1 8.2 6 1.5 


For each parameter, n_eff is a crude measure of effective sample size, 
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1). 

Avg acceptance probs: 0.27 0.42 0.59 0.68 


The acceptance rates seem low (recall our goal of approximately 65% acceptances), and so 
we decrease the base step size from 0.1 to 0.05 and increase the base number of steps from 
10 to 20: 


M3 <- hmc_run(starting_values=starts, iter=100, 
epsilon_0=.05, L_0=20, M=mass_vector) 


This looks better: 


Inference for the input samples (4 chains: each with iter=100; warmup=50): 


mean se_mean sd 25% 50% 75% n_eff Rhat 
theta[1] 16.5 2.7 11.5 9.1 14.6 23.3 16 1.1 
theta[2] 9.2 0.7 #7.7 #+%5.4 8.9 14.0 110 1.0 
theta[3] 6.6 0.8 8.4 1.9 6.2 12.4 99 1.0 
theta[4] 8.1 09> 8.8 3.7 7.2 13.5 96 1.0 
theta[5] 3.7 0.8 7.3 -2.1 3.5 8.9 75 1.0 
theta[6] 6.2 2.2 7.7 2.1 5.9 11.4 13 1.4 
theta[7] 12.9 1.1 8.8 7-2 13.2 19-1 61 1.1 
theta[8] 9.4 2.9 9.2 4.3 9.2 15.1 10 1.1 
mu 8.7 0:8 5.4 5.0 8.0 12.5 49 1.1 
tau 10.0 1.9 6.38 6.1 8.4 12.6 11 1:3 


For each parameter, n_eff is a crude measure of effective sample size, 
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1). 

Avg acceptance probs: 0.81 0.75 0.59 0.82 


We re-run for 1000 and then 10,000 iterations and obtain stable inferences: 
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Inference for the input samples (4 chains: each with iter=10000; warmup=5000) : 


mean se_mean sd 25% 50% 75% n_eff Rhat 


theta[1] 11.5 0.3 8.5 5.8 10.4 15.8 1129 1 
theta[2] 7.9 0.2 6.5 3.7 7.8 12.1 1853 1 
theta[3] 6.1 0.2 8.0 2.0 6.5 11.0 2434 1 
theta[4] 7.6 0.2 6.8 3.4 7:6 11.8 1907 1 
theta[5] 4.8 0.26.5 1:1 5.2 9.1 1492 1 
theta[6] 6.1 0.1 6.8 2.1 6.3 10.4 2364 1 
theta[7] 10.9 0.2 7.1 6.0 10.3 15.0 1161 1 
theta[8] 8.5 0.2 8.1 3.6 8.1 13.1 1778 1 
mu 8.0 0.2 5.4 4.4 7.9 11.3 1226 1 
tau 6.9 0.25.5 2.9 5.6 9.4 565 1 


For each parameter, n_eff is a crude measure of effective sample size, 
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1). 

Avg acceptance probs: 0.57 0.62 0.62 0.66 


For this little hierarchical model, setting up and running HMC was costly both in program- 
ming effort and computation time, compared to the Gibbs sampler. More generally, though, 
Hamiltonian Monte Carlo can work in complicated problems where Gibbs and Metropolis 
fail, which is why in Stan we implemented HMC (via the no-U-turn sampler). In addition, 
this sort of hierarchical model exhibits better HMC convergence when parameterized in 
terms of the group-level errors (that is, the vector 7, where 6; = u + Tny for j =1,...,J), 
as demonstrated in the Stan program in Section C.2. 


C.5 Further comments on computation 


We have already given general computational tips in Section 10.7: start by computing with 
simple models and compare to previous inferences when complexity is adding. We also 
recommend getting started with smaller or simplified datasets, but this strategy was not 
really relevant to the current example with only eight data points. 

There are various ways in which the programs in this appendix could be made more 
computationally efficient. For example, in the Metropolis updating function nu_update() 
for the t degrees of freedom in Section C.3, the log posterior density can be saved so that 
it does not need to be calculated twice at each step. It would also probably be good to 
use a more structured programming style in our R code (for example, in our updating 
functions mu_update(), tau-update(), and so forth) and perhaps to store the parameters 
and data as lists and pass them directly to the functions. We expect that there are many 
other ways in which our programs could be improved. Our general approach is to start 
with transparent (and possibly inefficient) code and then reprogram more efficiently once 
we know it is working. 

We made several mistakes in the process of implementing the computations described 
in this appendix. Simplest were syntax errors in Stan programs and related problems such 
as feeding in the wrong inputs when calling the stan() function from R. 

We fixed syntax errors and other minor problems in the R code by cutting and pasting 
to run the scripts one line at a time, and by inserting print statements inside the R functions 
to display intermediate values. 

We debugged the Stan and R programs in this appendix by comparing them against 
each other, and by comparing each model to previously fitted simpler models. We found 
many errors, including treating variances as standard deviations (for example, the expres- 
sion rnorm(1,alpha_-hat,V_alpha) instead of rnorm(1,alpha_hat,sqrt(V_alpha)) when 
simulating from a normal distribution in R), confusion between v and 1/v, forgetting a term 
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in the log-posterior density, miscalculating the Metropolis updating condition, and saving 
the wrong output in the sims array in the Gibbs sampling loop. 

More serious conceptual errors included a poor choice of prior distribution, which we 
realized was a problem by comparing to posterior simulations computed using a different 
algorithm. We also originally had an error in the programming of a reparameterized model, 
a mistake we discovered because the inferences differed dramatically from the simpler pa- 
rameterization. 

As the examples in this appendix illustrate, Bayesian computation is not always easy, 
even for relatively simple models. However, once a model has been debugged, it can be 
applied and then generalized to work for a range of problems. Ultimately, we find Bayesian 
simulation to be a flexible tool for fitting realistic models to simple and complex data 
structures, and the steps required for debugging are often parallel to the steps required to 
build confidence in a model. We can use R to graphically display posterior inferences and 
predictive checks. 


C.6 Bibliographic note 


R is available at R Project (2002), and its parent software package S is described by Becker, 
Chambers, and Wilks (1988). Two statistics texts that use R extensively are Fox (2002) 
and Venables and Ripley (2002). For more on Stan, see Stan Development Team (2012). R 
and Stan have online documentation, and their websites have pointers to various help files 
and examples. 
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in Stan and R, 607-608 
decision analysis, 12, 26, 99, 237-258 
and Bayesian inference, 237-239 
medical screening example, 245-246 
personal and institutional perspec- 
tives, 256 
radon example, 246-256 
survey incentives example, 239-244 
utility, 238, 245, 248, 256 
decision trees, 238, 245, 252 
degrees of freedom, 43, 437, 442 
delta method, 99 
density regression, 568-571 
dependent Dirichlet process (DDP), 562- 
564, 572 
derivatives, computation of, 313 
design of surveys, experiments, and ob- 
servational studies, 197—236 
designs that ‘cheat’, 219 
deviance, 192 
deviance information criterion (DIC), 172- 
173, 177, 192 
discussion, 182 
educational testing example, 179 
differences between data and population, 
207, 221, 223, 237, 422 
differential equation model in toxicology, 
477—485 
dilution assay, example of a nonlinear 
model, 471—476, 485 
dimensionality, curse of, 495 
Dirichlet distribution, 69, 580, 585 
Dirichlet process, 545-573 
Dirichlet process mixtures, 549-557 
discrepancy measure, 145 
discrete data 
adapting continuous models, 458 
latent-data formulation, 408 
logistic regression, 406 
multinomial models, 423—428 
Poisson regression, 406 
probit regression, 406 
discrete probability updating, 9, 245 
dispersion parameter for generalized lin- 
ear models, 405 
distinct parameters and ignorability, 202 
distribution, 577—586 
Bernoulli, 586 
beta, 30, 34, 60, 580, 584 
beta-binomial, 60, 580, 586 
binomial, 580, 585 
Cauchy, 98 
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x7, 578, 583 
Dirichlet, 69, 580, 585 
double exponential, 368, 493, 578 
exponential, 578, 583 
gamma, 45, 578, 583 
Gaussian, see normal distribution 
inverse-\”, 578, 583 
inverse-gamma, 43, 578, 583 
inverse-Wishart, 72, 578, 584 
Laplace, 368, 493, 578 
LKJ correlation, 578, 584 
log-logistic, 580 
logistic, 580 
lognormal, 578, 582 
long-tailed, 435 
multinomial, 580, 586 
multivariate normal, 79, 578, 582 
marginals and conditionals, 582 
multivariate t, 319, 580 
negative binomial, 44, 132, 580, 586 
normal, 577, 578 
normal-inverse-.”, 67, 82 
Pareto, 493 
Poisson, 580, 585 
scaled inverse-y’, 43, 65, 578, 583 
t, 66, 580, 584 
uniform, 577, 578 
Weibull, 578, 583 
Wishart, 578, 584 
divorce rates, 105, 135 
dog metabolism example, 380 
dose-response relation, 74 
double exponential distribution, 368, 493, 
578 


Eoia, 320 
ECM/ECME algorithms, 323, 348 
educational testing experiments, see SAT 
coaching experiments 
effective number of parameters, 169-182 
educational testing example, 179 
effective number of simulation draws, 286- 
288 
efficiency, 91 
eight schools, see SAT coaching experi- 
ments 
elections 
forecasting presidential elections, 165- 
166, 171-172, 383-388 
incumbency in U.S. Congress, 358- 
364 
polling in Slovenia, 463—466 
polling in U.S., 422-423, 456-462 
probability of a tie, 27 
EM algorithm, 320-325 
AECM algorithm, 324 
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as special case of variational infer- 
ence, 337 
debugging, 321 
ECM/ECME algorithms, 323, 348 
for missing-data models, 452 
parameter expansion, 325, 348 
SEM/SECM algorithms, 324-325 
empirical Bayes, why we prefer to avoid 
the term, 104 
environmental health 
allergen measurements, 472 
perchloroethylene, 477 
radon, 246 
EP, see expectation propagation 
estimands, 4, 24, 267 
exchangeable models, 5, 26, 104-108, 230 
and explanatory variables, 5 
and ignorability, 230 
no conflict with robustness, 436 
objections to, 107, 126 
universal applicability of, 107 
expectation propagation, 338-343 
cavity distribution, 339 
extensions, 343 
logistic regression example, 340-343 
moment matching, 339 
picture of, 342 
tilted distribution, 339 
experiments, 214-220 
completely randomized, 214-216 
definition, 214 
distinguished from observational stud- 
ies, 220 
Latin square, 216 
randomization, 218-220 
randomized block, 216 
sequential, 217, 235 
explanatory variables, see regression mod- 
els 
exponential distribution, 578, 583 
exponential families, 36, 338 
exponential model, 46, 60 
external validation, 142, 167 
record linkage example, 17 
toxicology example, 484 


factorial analysis, internet example, 397— 
398 
Federalist papers, 447 
finite-population inference, 200-203, 205- 
209, 212, 214-216, 232 
in Anova, 396-397 
Fisher information, 88 
fixed effects, 383 
and finite population in Anova, 397 
football point spreads, 13-16, 26, 27 
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forecasting presidential elections, 142, 383- 
388 
hierarchical model, 386 
problems with ordinary linear re- 
gression, 385 
frequency evaluations, 91—92, 98 
frequentist perspective, 91 
functional data analysis, 512-513 


gamma distribution, 45, 578, 583 
Gaussian distribution, see normal distri- 
bution 
Gaussian processes, 501-518 
birthdays example, 505-510 
golf putting, 517 
latent, 510-512 
logistic, 513-515 
gay marriage data, 499 
generalized linear models, 405-434 
computation, 409-412 
hierarchical, 409 
hierarchical logistic regression, 422— 
423 
hierarchical Poisson regression, 420— 
422 
overdispersion, 407, 431, 433 
prior distribution, 409 
simple logistic regression example, 
74-78 
genetics, 8, 183 
simple example of Bayesian infer- 
ence, 8-9, 27 
geometric mean (GM), 6 
geometric standard deviation (GSD), 6 
Gibbs sampler, 276-278, 280-281, 291 
all-at-once for hierarchical regression, 
393 
assessing convergence, 281-286 
blockwise for hierarchical regression, 
392 
efficiency, 293-295 
examples, 289, 440, 465, 528 
hierarchical linear models, 288-290, 
392-394, 396 
parameter expansion for hierarchi- 
cal regression, 393, 396 
picture of, 277 
programming in R, 597-608 
special case of Metropolis-Hastings 
algorithm, 281 
girl births, proportion of, 37-39 
global mode, why it is not special, 311 
GM (geometric mean), 6 
golf putting 
Gaussian process, 517 
nonlinear model for, 486, 499 
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goodness-of-fit testing, see model check- 
ing 
graphical models, 133 
graphics 
examples of use in model checking, 
143, 144, 154-158 
jittering, 14, 15, 27 
posterior predictive checks, 153-159 
grid approximation, 76-77, 263 
GSD (geometric standard deviation), 6 


Hamiltonian (hybrid) Monte Carlo, 300- 
307, 603-607 
for hierarchical model, 305-307, 603- 
607 
leapfrog algorithm, 301 
mass matrix, 301 
momentum distribution, 301 
no U-turn sampler, 304 
programming in R, 603-607 
tuning, 303 
heteroscedasticity in linear regression, 369— 
376 
parametric model for, 372 
hierarchical Dirichlet process (HDP), 564— 
566 
hierarchical linear regression, 381—404 
computation, 392-394, 396 
interpretation as a single linear re- 
gression, 389 
hierarchical logistic regression, 422—423 
hierarchical models, 5, 101-137, 381—404 
analysis of variance (Anova), 395 
binomial, 109-113, 136 
bivariate normal, 209-210 
business school grades, 391-392 
cluster sampling, 210-212 
computation, 108-113 
forecasting elections, 383-388 
logistic regression, 422—423 
many batches of random effects 
election forecasting example, 386 
polling example, 422-423 
meta-analysis, 124-128, 423-425 
multivariate, 390-392, 423-425, 456 
462 
no unique way to set up, 389 
normal, 113-128, 288-290, 326-330 
NYPD stops, 420-422 
pharmacokinetics example, 480—481 
Poisson, 137, 420-422 
pre-election polling, 209-210 
prediction, 108, 118 
prior distribution, see hyperprior dis- 
tribution 
radon, 246-256 
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rat tumor example, 109-113 
SAT coaching, 119-124 
schizophrenia example, 524-533 
stratified sampling, 209-210 
survey incentives, 239-244 
hierarchical Poisson regression, 420—422 
hierarchical regression, 381—404 
prediction, 387 
highest posterior density interval, 33, 56, 
60 
HMC, see Hamiltonian Monte Carlo 
horseshoe prior distribution for regres- 
sion coefficients, 378 
hybrid Monte Carlo, see Hamiltonian Monte 
Carlo 
hyperparameter, 34, 101, 105 
hyperprior distribution, 107—108 
informative, 480—481 
noninformative, 108, 110, 111, 115, 
117, 135, 424, 526 
hypothesis testing, 145, 150 


identifiability, 365 
ignorability, 202-205, 230, 450 
and exchangeability, 230 
incumbency example, 359 
strong, 203 
ignorable and known designs, 203 
ignorable and known designs given co- 
variates, 203 
ignorable and unknown designs, 204 
iid (independent, identically distributed), 
5 
ill-posed systems 
differential equation model in toxi- 
cology, 477—485 
mixture of exponentials, 486 
importance ratio, 264 
importance resampling (sampling- impor- 
tance resampling, SIR), 266, 271, 
273, 319 
examples, 441, 442 
why you should sample without re- 
placement, 266 
importance sampling, 265, 271 
bridge sampling, 347, 348 
for marginal posterior densities, 440 
path sampling, 347-348 
unreliability of, 265 
improper posterior distribution, see pos- 
terior distribution 
improper prior distribution, see prior dis- 
tribution 
imputation, see multiple imputation 
inclusion indicator, 200, 449 
incumbency advantage, 358-364 
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two variance parameters, 374 
indicator variables, 366 
for mixture models, 519 
inference 
discrete examples, 8-11 
one of the three steps of Bayesian 
data analysis, 3 
inference, finite-population and superpop- 
ulation, 201-202, 212, 214 
completely randomized experiments, 
215-216, 232 
in Anova, 396-397 
pre-election polling, 208-209 
simple random sampling, 205-206 
information criteria, 169-182 
information matrix, 84, 88 
informative prior distribution 
alternative to selecting regression vari- 
ables, 367-369 
spell checking example, 10 
toxicology example, 480 
institutional decision analysis, 256 
instrumental variables, 224 
integrated nested Laplace approximation 
(INLA), 343 
intention-to-treat effect, 224 
interactions 
in basis-function models, 497 
in Gaussian processes, 504, 511 
in loglinear models, 429 
in regression models, 242, 367 
internet connect times, 397-398 
intraclass correlation, 382 
inverse cdf for posterior simulation, 23 
inverse probability, 56 
inverse-y” distribution, 578, 583 
inverse-gamma distribution, 43, 578, 583 
inverse- Wishart distribution, 72, 578, 584 
iterative proportional fitting (IPF), 430- 
431 
iterative simulation, see Markov chain 
Monte Carlo, 293-310 
iterative weighted least squares (EM for 
robust regression), 444 


jackknife, 96 

Jacobian, 22 

Jeffreys’ rule for noninformative prior dis- 
tributions, 52-53, 57, 59 

jittering, 14, 15, 27 

joint posterior distribution, 63 


Kullback-Leibler divergence, 88, 331-336, 
587-589 
connection to deviance, 192 


label switching in mixture models, 533 
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Laplace distribution, 368, 493, 578 
Laplace’s method for numerical integra- 
tion, 318, 348 
large-sample inference, 83-92 
lasso (regularized regression), 368-369, 
379 
latent continuous models for discrete data, 
408 
latent-variable regression, 515 
Latin square experiment, 216-217 
LD50, 77-78 
leapfrog algorithm for Hamiltonian Monte 
Carlo, 301 
leave-one-out cross-validation, 175-177 
discussion, 182 
educational testing example, 179 
life expectancy, quality-adjusted, 245 
likeilhood principle 
model checking and, 152 
likelihood, 7-10 
complete-data, 200 
observed-data, 201 
likelihood principle, 7, 26 
misplaced appeal to, 198 
linear regression, 353-380, see also re- 
gression models 
t errors, 444—445 
analysis of residuals, 361 
classical, 354 
conjugate prior distribution, 376- 
378 
as augmented data, 377 
correlated errors, 369-376 
errors in x and y, 379, 380 
heteroscedasticity, 369-376 
parametric model for, 372 
hierarchical, 381—404 
interpretation as a single linear 
regression, 389 
incumbency example, 358-364 
known covariance matrix, 370 
model checking, 361 
posterior simulation, 356 
prediction, 357, 364 
with correlations, 371 
residuals, 358, 362 
robust, 444—445 
several variance parameters, 369- 
376 
weighted, 372 
link function, 405, 407 
LKJ correlation distribution, 578, 584 
location and scale parameters, 53 
log densities, 261 
log-logistic distribution, 580 
logistic distribution, 580 
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logistic regression, 74-78, 406 
for multinomial data, 423 
hierarchical, 422—423 
latent-data interpretation, 408 
logit (logistic, log-odds) transformation, 
22, 125 
loglinear models, 428-431 
prior distributions, 429 
lognormal distribution, 578, 582 
longitudinal data 
survey of adolescent smoking, 211- 
212 


maps 
artifacts in, 46-51, 57 
cancer rates, 46-51 
for model checking, 143 
MAR (missing at random), 202, 450 
a more reasonable assumption than 
MCAR, 450 
marginal and conditional means and vari- 
ances, 21 
marginal posterior distribution, 63, 110, 
111, 122, 261 
approximation, 325-326 
computation for the educational test- 
ing example, 596-597 
computation using importance sam- 
pling, 440 
EM algorithm, 320-325 
marginal predictive checks, 152 
Markov chain, 275 
Markov chain Monte Carlo (MCMC), 275- 
310 
adaptive algorithms, 297 
assessing convergence, 281-286 
between/within variances, 283 
simple example, 285 
auxiliary variables, 297-299, 309 
burn-in, why we prefer the term warm- 
up, 282 
data augmentation, 293 
effective number of simulation draws, 
286-288 
efficiency, 280, 293-296 
Gibbs sampler, 276-278, 280-281, 
291 
assessing convergence, 281—286 
efficiency, 293-295 
examples, 277, 289, 392, 440, 465, 
528 
picture of, 277 
programming in R, 597-608 
Hamiltonian (hybrid) Monte Carlo, 
300-307, 309 
for hierarchical model, 305-307 
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leapfrog algorithm, 301 
mass matrix, 301 
momentum distribution, 301 
no U-turn sampler, 304 
tuning, 303 
inference, 281-286 
Metropolis algorithm, 278-280, 291 
efficient jumping rules, 295-297 
examples, 278, 290 
generalizations, 293-300 
picture of, 276 
programming in R, 600-601 
relation to optimization, 279 
Metropolis-Hastings algorithm, 279, 
291 
generalizations, 293-300 
multiple sequences, 282 
output analysis, 281-288 
overdispersed starting points, 283 
parallel tempering, 299-300 
perfect simulation, 309 
regeneration, 309 
reversible jump sampling, 297-299, 
309 
simulated tempering, 309 
slice sampling, 297, 309 
thinning, 282 
trans-dimensional, 297—299, 309 
warm-up, 282 
matrix and vector notation, 4 
maximum entropy, 57 
maximum likelihood, 93 
MCAR (missing completely at random), 
450 
measurement error models 
hierarchical, 133 
linear regression with errors in x and 
y, 380 
nonlinear, 471—476 
medical screening, example of decision 
analysis, 245-246 
meta-analysis, 133, 137 
beta-blockers study, 124-128, 423- 
425 
bivariate model, 423-425 
goals of, 125 
survey incentives study, 239-242 
Metropolis algorithm, 278-280, 291 
efficient jumping rules, 295-297 
examples, 278, 290 
generalizations, 293-300 
picture of, 276 
programming in R, 600-601 
relation to optimization, 279 
Metropolis-Hastings algorithm, 279, 291 
generalizations, 293-300 
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minimal analysis, 217 
missing at random (MAR), 202, 450 
a more reasonable assumption than 
MCAR, 450 
a slightly misleading phrase, 202 
missing completely at random (MCAR), 
450 
missing data, 449-467 
and EM algorithm, 452, 454 
intentional, 198 
monotone, 453, 455, 459-462 
multinomial model, 462 
multivariate normal model, 454—456 
multivariate t model, 456 
notation, 199, 449-452 
paradigm for data collection, 199 
Slovenia survey, 463—466 
unintentional, 198, 204, 449 
mixed-effects model, 382 
mixture models, 17, 20, 105, 135, 519- 
543 
computation, 523-524 
continuous, 520 
de Finetti’s theorem and, 105 
Dirichlet process, 549-557 
discrete, 519 
exponential distributions, 486 
hierarchical, 525 
label switching, 533 
model checking, 531, 532 
prediction, 530 
schizophrenia example, 524-533 
mixture of exponentials, as example of 
an ill-posed system, 478, 486 
model, see also hierarchical models, re- 
gression models, etc. 
beta-binomial, 438 
binomial, 29, 37, 80, 147 
Cauchy, 59, 437 
Dirichlet process, 545 
exponential, 46, 60 
lognormal, 188 
multinomial, 69, 79, 423-428 
multivariate normal, 70 
negative binomial, 437, 446 
nonlinear, 471—486 
normal, 39, 41, 42, 60, 64-69 
overdispersed, 437—439 
Poisson, 43, 44, 59, 60 
Polya urn, 549 
robit, 438 
robust or nonrobust, 438-439 
t, 293, 437, 441-445 
underidentified, 89 
model averaging, 193, 297 
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model building, one of the three steps of 
Bayesian data analysis, 3 
model checking, 141-164, 187-195 
adolescent smoking, 148 
adolescent smoking example, 150 
election forecasting, 386 
election forecasting example, 142 
incumbency example, 361 
one of the three steps of Bayesian 
data analysis, 3 
power transformation example, 189 
pre-election polling, 210 
psychology examples, 154-157 
residual plots, 158, 476, 484 
SAT coaching, 159-161 
schizophrenia example, 531, 532 
speed of light example, 143, 146 
spelling correction example, 10 
toxicology example, 483 
model comparison, 178-184 
model complexity, see effective number 
of parameters 
model expansion, 184-192 
continuous, 184, 372, 439 
schizophrenia example, 531-532 
model selection 
bias induced by, 181 
why we reluctantly do it, 178, 183- 
184, 367 
moment matching in expectation propa- 
gation, 339 
momentum distribution for Hamiltonian 
Monte Carlo, 301 
monitoring convergence of iterative sim- 
ulation, 281—286 
monotone missing data pattern, 453, 455, 
459-462 
Monte Carlo error, 267, 268, 272 
Monte Carlo simulation, 267-310 
multilevel, see hierarchical models 
multimodal posterior distribution, 299, 
319 
multinomial distribution, 580, 586 
multinomial logistic regression, 426 
multinomial model, 69, 79 
for missing data, 462 
multinomial probit model, 432 
multinomial regression, 408, 423-428 
parameterization as a Poisson re- 
gression, 427 
multiparameter models, 63-82 
multiple comparisons, 96, 134, 150, 186 
multiple imputation, 201, 451-454 
combining inferences, 453 
pre-election polling, 456-462 
Slovenia survey, 463—466 
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multiple modes, 311, 321 
multivariate models 
for nonnormal data, 423-425 
hierarchical, 390-392 
prior distributions 
noninformative, 458 
multivariate normal distribution, 578, 582 
multivariate t distribution, 319, 580 


natural parameter for an exponential fam- 


ily, 36 
negative binomial distribution, 44, 132, 
580, 586 
as overdispersed alternative to Pois- 
son, 437, 446 


nested Dirichlet process (NDP)(, 566, 568 
neural networks, 485 
New York population, 188-191 
Newcomb’s speed of light experiment, 66, 
79 
Newton’s method for optimization, 312 
no interference between units, 200 
no U-turn sampler for Hamiltonian Monte 
Carlo, 304 
non-Bayesian methods, 92-97, 100 
difficulties for SAT coaching exper- 
iments, 119 
nonconjugate prior distributions, see prior 
distribution 
nonidentified parameters, 89 
nonignorable and known designs, 204 
nonignorable and unknown designs, 204 
noninformative prior distribution, 51-56 
binomial model, 37, 53 
difficulties, 54 
for hyperparameters, 108, 110, 111, 
115, 117, 526 
in Stan, 596 
Jeffreys’ rule, 52-53, 57, 59 
multivariate normal model, 73 
normal model, 64 
pivotal quantity, 53-54, 56 
nonlinear models, 471—486 
Gaussian processes, 501-518 
golf putting, 486, 499 
mixture of exponentials, 486 
serial dilution assay, 471-476 
splines, 487—499 
toxicology, 477-485 
nonparametric methods, 96 
nonparametric models, 501-518, 545-573 
nonparametric regression, 487—499 
nonrandomized studies, 220 
normal approximation, 83-87, 318-319 
bioassay experiment, 86 
for generalized linear models, 409 
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lower-dimensional, 85 
meta-analysis example, 125 
multimodal, 319 
normal distribution, 577, 578 
normal model, 39, 41, 60, 64-69, see also 
linear regression and hierarchi- 
cal models 
multivariate, 70, 454—462 
power-transformed, 188-191, 194— 
195 
normalizing factors, 7, 345-349 
notation for data collection, 199 
notation for observed and missing data, 
199, 449, 452 
nuisance parameters, 63 
numerical integration, 271, 318-319, 345- 
348 
Laplace’s method, 318, 348 
numerical posterior predictive checks, 143- 
152 
NYPD stops example, 420-422 


objective assignment of probability dis- 
tributions 
football example, 13-16 
record linkage example, 16-19 
objectivity of Bayesian inference, 13, 24 
observational studies, 220-224 
difficulties with, 222 
distinguished from experiments, 220 
incumbency example, 358-364 
observed at random, 450 
observed data, see missing data 
observed information, 84 
odds ratio, 8, 80, 125 
offsets for generalized linear models, 407 
chess example, 428 
police example, 420 
optimization and the Metropolis algorithm, 
279 
ordered logit and probit models, 408, 426 
outcome variable, 353 
outliers, models for, 435 
output analysis for iterative simulation, 
281-288 
overdispersion, 407, 431, 433, 437—439 
overfitting, 101, 367, 409 


p-values, see also model checking 
Bayesian (posterior predictive), 146 
classical, 98, 145 
interpretation of, 150 

packages, see software 

paired comparisons with ties, 432 
multinomial model for, 427 

parallel tempering for MCMC, 299-300 

parameter expansion 
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election forecasting example, 393 
for Anova computation, 396 
for EM algorithm, 325, 348 
for hierarchical regression, 393, 396 
programming in R, 601-603 
parameters, 4 
different from predictions, in frequen- 
tist inference, 94, 401 
Pareto distribution, 493 
partial pooling, see shrinkage 
partially conjugate prior distribution, 115, 
322 
path sampling, 347-348 
perchloroethylene, 477 
perfect simulation for MCMC, 309 
permutation tests, 96 
personal (subjective) probability, 13, 256 
pharmacokinetics, 480-481 
philosophy, references to discussions of, 
26 
pivotal quantity, 53-54, 56, 66, 151 
point estimation, 85, 91, 99 
Poisson distribution, 580, 585 
Poisson model, 43, 59, 60 
parameterized in terms of rate and 
exposure, 44 
Poisson regression, 82, 406, 433 
for multinomial data, 426 
hierarchical, 420—422 
police stops, example of hierarchical Pois- 
son regression, 420—422 
Polya urn model, 549 
pooling, partial, 25, 115 
population distribution, 101 
posterior distribution, 3, 7, 10 
as compromise, 32, 40, 58 
improper, 54, 90, 135 
joint, 63 
marginal, 63 
normal approximation, see normal 
approximation 
predictive, 7 
summaries of, 32 
use as prior distribution when new 
data arrive, 9, 251 
posterior intervals, 3, 33, 267 
posterior modes, 311-330 
approximate conditional posterior den- 
sity using marginal modes, 325 
conditional maximization (stepwise 
ascent), 312 
EM algorithm for marginal poste- 
rior modes, 320-325, 348 
ECM/ECME algorithms, 323, 456, 
526 
examples, 322, 329, 444, 465, 526 
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generalized EM algorithm, 321 
marginal posterior density increases 
at each step, 329 
missing data, 452, 454 
SEM/SECM algorithms, 324-325, 
465 
joint mode, problems with, 350 
Newton’s method, 312 
posterior predictive checks, 143-161, see 
also model checking 
graphical, 153-159 
numerical, 143-152 
posterior predictive distribution, 7 
hierarchical models, 108, 118 
linear regression, 357 
missing data, 202 
mixture model, 530 
multivariate normal model, 72 
normal model, 66 
speed of light example, 144 
posterior simulation, 22—24, 267-310, see 
also Markov chain Monte Carlo 
(MCMC) 
computation in R and Stan, 591- 
608 
direct, 263-264 
grid approximation, 76-77, 263 
hierarchical models, 112 
how many draws are needed, 267, 
268, 272 
rejection sampling, 264 
simple problems, 78 
two-dimensional, 76, 82 
using inverse cdf, 23 
poststratification, 222, 422—423, 460 
potential scale reduction factor, 285 
power transformations, 188-191, 194-195 
pre-election polling, 69, 79, 233-234, 422- 
423 
in Slovenia, 463-466 
missing data, 456-466 
state-level opinions from national polls, 
422-423 
stratified sampling, 207-210 
precision (inverse of variance), 40 
prediction, see posterior predictive dis- 
tribution 
predictive simulation, 28 
predictor variables, see regression mod- 
els, explanatory variables 
predictors 
including even if not ‘statistically 
significant’, 241-244 
selecting, 186 
principal stratification, 223-224 
prior distribution, 6, 10 
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boundary-avoiding, 313-318 
conditionally conjugate, 129, 130, 
280, 315, 332, 503 
conjugate, 35-37, 56 
binomial model, 34-35, 38 
exponential model, 46 
generalized linear models, 409 
linear regression, 376-378 
multinomial model, 69, 429, 462 
multivariate normal model, 71, 72 
normal model, 39, 43, 67 
Poisson model, 44 
estimation from past data, 102 
for covariance matrices 
noninformative, 458 
hierarchical, see hierarchical mod- 
els and hyperprior distribution 
improper, 51, 82 
and Bayes factors, 194 
informative, 34—46, 480-481 
nonconjugate, 36, 38, 75 
noninformative, 51-56, 93 
t model, 443 
binomial model, 37, 53 
difficulties, 54 
for hyperparameters, 108, 110, 111, 
115, 117, 526 
generalized linear models, 409 
in Stan, 592, 596 
Jeffreys’ rule, 52-53, 57, 59 
linear regression, 355 
multinomial model, 464 
multivariate normal model, 73 
normal model, 64 
pivotal quantity, 53-54, 56 
warnings, see posterior distribu- 
tion, improper 
partially conjugate, 115, 322 
predictive, 7 
proper, 51 
weakly informative, 55-57, 128-132, 
313-318 
in Stan, 596 
prior predictive checks, 162, 164 
prior predictive distribution, 7 
normal model, 41 
probability, 19-22, 26 
assignment, 13-19, 26, 27 
foundations, 11-13, 25 
notation, 6 
probability model, 3 
probit regression, 406 
for multinomial data, 426, 432 
Gibbs sampler, 408 
latent-data interpretation, 408 
probit transformation, 22 
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programming tips, 270-271, 607-608 

propensity scores, 204, 221, 222, 230 

proper prior distribution, see prior dis- 
tribution 

proportion of female births, 29, 37-39 

psychological data, 154-157, 524-533 

PX-EM algorithm, 325, 348, see also pa- 
rameter expansion 


QR decomposition, 356, 378 
quality-adjusted life expectancy, 245 
quasi-Newton optimization, 313 


R, see software 
R for monitoring convergence of iterative 
simulation, 285 
radial basis functions, 487 
radon decision problem, 194, 246-256, 
378 
random probability measure (RPM), 545- 
573 
random-effects model, 382-388 
analysis of variance (Anova), 395 
and superpopulation in Anova, 397 
election forecasting example, 386 
non-nested example, 422-423 
several batches, 383 
randomization, 218-220 
and ignorability, 220, 230 
complete, 218 
given covariates, 219 
randomized blocks, 231 
rank test, 97 
rat tumors, 102-103, 109-113, 133 
ratio estimation, 93, 98 
record linkage, 16-19 
record-breaking data, 230 
reference prior distributions, see nonin- 
formative prior distribution 
regeneration for MCMC, 309 
regression models, 353-380, see also lin- 
ear regression 
Bayesian justification, 354 
explanatory variables, 5, 200, 353, 
365-367 
exchangeability, 5 
exclude when irrelevant, 367 
ignorable models, 203 
goals of, 364-365 
hierarchical, 381—404 
variable selection, 367 
why we prefer to use informative 
prior distributions, 367-369 
regression to the mean, 95 
regression trees, 485 
regularization, 51, 113-124, 368-369, 493 
rejection sampling, 264, 273 
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picture of, 264 
replications, 145 
residual plots, 162, 358 
binned, 157-158 
dilution example, 476 
incumbency example, 362 
nonlinear models, 476, 484 
pain relief example, 158 
toxicology example, 484 
residuals, 157 
response surface, 126 
response variable, 353 
reversible jump sampling for MCMC, 297— 
299, 309 
ridge regression, 401 
robit regression (robust alternative to logit 
and probit), 438 
robust inference, 162, 185, 192, 435—447 
for regression, 444—445 
SAT coaching, 441—444 
various estimands, 191 
rounded data, 80, 234 


sampling, 205-214, see also surveys 
capture-recapture, 233 
cluster, 210-212, 232 
poststratification, 222, 422—423 
ratio estimation, 93, 98 
stratified, 206-210 
unequal selection probabilities, 212— 
214, 233-234 
sampling distribution, 6, 35 
relevance to model checking, 152 
SAT coaching experiments, 119-124 
difficulties with natural non-Bayesian 
methods, 119 
information criteria and effective num- 
ber of parameters, 179 
model checking for, 159-161 
robust inference for, 441—444 
scale parameter, 42 
scaled inverse-y? distribution, 43, 578, 
583 
scaled inverse-Wishart model, 74, 390 
schizophrenia reaction times, example of 
mixture modeling, 524-533 
selection of predictors, 186 
SEM/SECM algorithms, 324-325, 348 
sensitivity analysis, 160-161, 184, 185, 
435-447 
and data collection, 191 
and realistic models, 191 
balanced and unbalanced data, 221 
cannot be avoided by setting up a 
super-model, 141 
estimating a population total, 188— 
191 


SUBJECT INDEX 


incumbency example, 363 
SAT coaching, 441-444 
using t models, 443-444 
various estimands, 191 
sequential designs, 217, 235 
serial dilution assay, example of a non- 
linear model, 471—476, 485 
sex ratio, 29, 37-39 
shrinkage, 32, 40, 45, 113-124, 132, 368- 
369, 490, 493 
graphs of, 113, 122 
simple random sampling, 205-206 
difficulties of estimating a popula- 
tion total, 188 
simulated tempering for MCMC, 309 
simulation, see posterior simulation 
single-parameter models, 29-62 
SIR, see importance resampling 
slice sampling for MCMC, 297, 309 
Slovenia survey, 463—466 
small-area estimation, 133 
software, 591-608 
Bugs, 27, 133, 269, 272 
debugging, 270-271, 607-608 
extended example using Stan and 
R, 592-607 
programming tips, 270-271, 607—608 
R, 22, 27, 591-608 
running Stan from R, 591 
setting up, 591 
Stan, 22, 269, 307-308, 591-596 
speed of light example, 66, 143 
posterior predictive checks, 146 
spelling correction, simple example of Bayesian 
inference, 9-11 
splines, 487—499 
gay marriage, 499 
golf putting, 499 
multivariate, 495—498 
sports 
football, 13-16, 26 
golf, 486, 499, 517 
stability, 200 
stable estimation, 91 
stable unit treatment value assumption, 
200, 231 
Stan, 307-308, 591-596 
standard errors, 85 
state-level opinions from national polls, 
422-423 
statistical packages, see software 
statistically significant but not practically 
significant, 151 
regression example, 363 
stepwise ascent, 312 
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stepwise regression, Bayesian interpreta- 
tion of, 367 
stratified sampling, 206-210 
hierarchical model, 209-210, 292 
pre-election polling, 207-210 
strong ignorability, 203 
Student-t model, see t model 
subjectivity, 12, 13, 26, 28, 100, 248, 256 
sufficient statistics, 36, 93, 338 
summary statistics, 85 
superpopulation inference, 200-203, 205- 
206, 208, 209, 212, 214-216, 
232 
in Anova, 396-397 
supplemented EM (SEM) algorithm, 324— 
325 
survey incentives, example of meta-analysis 
and decision analysis, 239-244 
surveys, 205-214, 454—466, see also sam- 
pling 
adolescent smoking, 148-150 
Alcoholics Anonymous, 213-214 
incentives to increase response rates, 
239-244 
pre-election polling, 207-210, 422- 
423, 456-466 
telephone, unequal sampling prob- 
abilities, 233-234 


t approximation, 319 
t distribution, 66, 580, 584 
t model, 437, 441-445 
computation using data augmenta- 
tion, 293-294 
computation using parameter expan- 
sion, 295 
interpretation as mixture, 437 
tail-area probabilities, see p-values 
target distribution, 261 
test statistics and test quantities, 145 
choosing, 147 
examples, see model checking 
graphical, 153-159 
numerical, 143-152 
thinning of MCMC sequences, 282 
three steps of Bayesian data analysis, 3 
tilted distribution in expectation propa- 
gation, 339 
toxicology model, as example of an ill- 
posed system, 477—485 
trans-dimensional MCMC, 297-299, 309 
transformations, 21, 99 
examples where not needed, 241, 360 
logarithmic, 380 
logistic (logit, log-odds), 22, 125 
power, 188-191, 194-195 
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probit, 22 
rat tumor example, 110 
to improve MCMC efficiency, 293- 
295 
to reduce correlations in hierarchi- 
cal models, 480 
useful in setting up a multivariate 
model, 424 
treatment variable, 353 
truncated data, 224-228 
2 x 2 tables, 80, 125, 423-425 
type I errors, why we do not care about, 
150 


U.S. House of Representatives, 358 

unbiasedness, see bias 

unbounded likelihoods, 90 

underidentified models, 89 

uniform distribution, 577, 578 

units, 353 

unnormalized densities, 7, 261 

unseen species, estimating the number 
of, 349 

utility in decision analysis, 238, 245, 248, 
256 


variable selection, why we prefer to use 
informative priors, 367 
variance matrix, see covariance matrix 
variational inference, 331-338 
EM as special case, 337 
hierarchical model example, 332-335 
model checking for, 336 
picture of, 334, 335, 342 
variational lower bound, 336 
varying intercepts and slopes, 390-392 
vector and matrix notation, 4 


warm-up for MCMC sequences, 282 
Watanabe-Akaike or widely applicable in- 
formation criterion (WAIC), 173- 
174, 177 
discussion, 182 
educational testing example, 179 
weakly informative prior distribution, 55— 
57, 128-132, 313-318 
in Stan, 596 
Weibull distribution, 578, 583 
Wilcoxon rank test, 97 
Wishart distribution, 578, 584 


y"? 145 
Ü, 4, 7, 145 


